Serious Character Unicode Input bugs in Windows Word

We posted this report a few days ago as a reply to an older thread "Combining diacritics positioned incorrectly in Word 2016". It received some views but no actual response. Because we believe these are serious long-standing bugs that are in need of urgent attention we have posted it again this time as a new thread. We also tried posting to Word's "Feedback" option from within the program itself but there is a very low word limit on those postings so all we could do was ask Microsoft to reply to us with an address to send the full report. We are yet to receive any response. If someone can tell us a better, more direct way than this Community forum to get this report to the relevant teams at Microsoft we would be extremely grateful.

In Microsoft Word for Windows 2016 there are three ways to insert Unicode encoded characters that are not accessible from the standard inbuilt keyboard (ie non ASCII). We have found significant flaws in two out of these three methods. They appear to be bugs.
___________________________________________________________

(A)  INSERT SYMBOL - generally works fine. No problem.
___________________________________________________________

(B}  DIRECT UNICODE INPUT USING XXXX[Alt+X] METHOD - we have discovered serious, consistent, easily-reproduced flaws in this method. This input method is documented in Word Help and it's referred to as "Shortcut key" at the bottom of the Insert/Symbol dialog box. This input method is limited to Word (doesn't work in Publisher or PowerPoint)

There are indications pointing to the problem on a number of internet forum posts eg

https://www.howtogeek.com/239321/how-to-manually-create-compound-characters-in-word/
          (see the paragraph near the bottom which starts with "There is a situation where this second method doesn’t work...")

https://qualityandinnovation.com/2014/11/22/typing-x-bar-y-bar-p-hat-q-hat-and-all-that/
          (see the very last post in the last paragraph beginning "I had mixed results simply using the “0305 Alt-x” shortcut")

We have exhaustively analyzed this problem and believe we have fully identified the issue:

SUMMARY OF THE PROBLEM:

If the last letter typed before you type the 4-character hexadecimal unicode is 0-9, a-f or A-G (in other words a character that COULD be a character in a valid hexadecimal unicode) then Alt+X incorrectly reads back 5 characters instead of 4 and the result in most fonts is invariably an undefined character. If the font happened to contain that 5-character unicode then this is what would be displayed. Furthermore Word RETAINS the 5-character sequence and will regenerate it as a 5-character unicode (instead of the starting letter plus 4-character Unicode) if you click Alt+X again. If the starting letter was Lowercase a-f then Alt+X will regenerate it as an Uppercase because it has read the first letter as part of a 5-character Unicode. Further, there is one other letter outside the a-f range that results in the failure of the unicode input method - typing the unicode after x or X and clicking Alt+X produces no result and the unicode is not converted.

REPRODUCIBLE TEST:

In any font that contains Combining Diacritics (eg Calibri, Arial etc etc):

Type a0300 then Alt+X. You'll get the "undefined character" box (may contain question mark) because the Alt+X converter has included the "a" in its back-reading of the unicode value. The font does not contain the character with unicode A0300, hence the undefined character box.

{Note that typing a0300 Alt+X SHOULD result in a lowercase a with Combining Grave (U+0300) on top)

Click Alt+X again and it's converted back to the "unicode" A0300 (with uppercase A, confirming that the original "a" has been wrongly intepreted as part of a Unicode.

Typing uppercase A  followed by 0300 then Alt+X will fail similarly.

So will typing the number 9 followed by 0300 then Alt+X.

All indications are that above test will fail in exactly the same way whenever the first letter is a-f, A-F or 0-9 no matter what 4-character unicode is entered afterwards (not just 0300!)

BUT if you type g0300 then Alt+X, you'll get the correct output, ie a letter g with grave on top.

... and so on, right through to z EXCEPT for x and X.

(Type x0300 or X0300 then Alt+X and NOTHING happens - the unicode 0300 remains unconverted. Same result with all unicodes typed after x or X).

WHEN WOULD THIS BE A PROBLEM?

1. Trying to input a unicode letter (or a diacritic on top of a letter) WITHIN a word eg
   The German preposition "für" ("for") can by typed two ways - one will fail and one will work:
   (a) f1209[Alt+X]r will FAIL - the "f" is read by the Alt+X converter as part of a 5-character unicode value.
   (b) fu0308[Alt+X]r will SUCCEED - Alt+X reads back only 4 characters because "u" could not form part of a unicode.

2. 5-character unicodes have existed for quite a few years so a further ambiguity has consequently been added into this already error prone process of getting Alt+X to read backwards. How does it know whether to read 4 characters back or 5? The terminal boundary of the imputted unicode is unambiguous - it occurs at the point Alt+X is typed. But the beginning of the inputted unicode is currently ambiguous. If you just always read back 4 characters that will solve the above issue BUT it will rule out the input of a 5-character unicode. BOTH the beginning AND end boundaries of the inputted unicode need to be unambiguous - as with the Alt(hold)+XXXX(release) method or Mac's Unicode Hex input keyboard.
___________________________________________________________

(C)  DIRECT UNICODE INPUT FROM CUSTOM SOFTWARE KEYBOARD

This issue was discovered during our use of a custom keyboard for inputting combining diacritics to easily create a large library of Sanskrit transliteration characters. The issue appears to be related to a misinterpretation of certain unicode keyboard inputs by Word's Unicode keyboard processing engine.

The same issue occurs in Word, Publisher and PowerPoint.

We have a custom keyboard in which 11 diacritics from the Unicode Combining Diacritics range are encoded with their correct unicodes: 0300, 0301, 0303, 0304, 0306, 0307, 030D, 030E, 0310, 0323, 0331. They are accessed using the AltGr key in Caps Lock mode (equivalent to Shift+AltGr mode).

The custom keyboard input method works perfectly for ALL 11 diacritics after ALL letters a-z, A-Z in Microsoft's own NotePad and WordPad and also in CorelDraw X9. The output in all situations - including entering the diacritics by themselves (ie no preceding letter) - is 100% correct and robust.

Our saved NotePad and WordPad documents can be opened in Word and all 11 diacritics over all letters display correctly in all combinations. Also, the raw text copied from the NotePad and WordPad documents can be pasted directly into Word and again all diacritics on all letters display correctly.

In Mac Word the corresponding custom keyboard input method works fine for ALL 11 diacritics after ALL letters. This Mac Word document containing all the letter+diacritic combinations opens and displays perfectly when opened in Windows Word.

In Windows Word, ALL 11 diacritics can be typed successfully after all vowels (a e i o u) and also after the consonant/proto-vowel y.

However, 4 particular diacritics 0300, 0301, 0303 and 0323 CANNOT BE SUCCESSFULLY TYPED BY KEYBOARD AFTER A CONSONANT (except for y) OR BY THEMSELVES. The input method fails and no character is displayed. The same result occurs in a many fonts including Calibri and Arial.

It was initially suspected that because these 4 diacritics happen to be the only ones out of our 12 which have legacy glyph names (gravecomb, acutecomb, tildecomb, dotbelowcomb) from the Adobe Glyph List (v1.7) and therefore they do not have generic uniXXXX names, the issue may be related to this difference. These non-uniXXXX names are quite standard across the majority of fonts. To test this theory, a custom font was created in which those 4 glyphs were assigned their generic uniXXXX format names instead of the recommended Adobe Glyph List names ie they were re-named uni0300, uni0301, uni0303 and uni0323 respectively. This made NO DIFFERENCE to the behaviour in Word, however it seems too much of a coincidence for this glyph name anomaly not to be related to the issue perhaps in some subtle way.

It is not known how many other characters this problem occurs with.

We can think of no valid typological nor linguistic reason for this behaviour and, in any case, we know it works fine if done from within the Insert Symbol window. From our testing in various versions of Word, it appears this issue has been part of Microsoft Word's code at least as far back as Word 2003, possibly earlier.

REPRODUCIBLE TEST:

1. Set up a custom keyboard to input the 11 Combining Diacritics listed above: 0300, 0301, 0303, 0304, 0306, 0307, 030D, 030E, 0310, 0323, 0331. To investigate the possible relevance of the glyph name relevance further we suggest also including 0309 (hookabovecomb) which is the only other combining diacritic with a non-uniXXXX format name.

2. Using any font that contains Combining Diacritics (eg Calibri, Arial etc etc), type any vowel a, e, i, o, u, or y and after it, in turn, type the 11 diacritics. You will see that they ALL display.

3. Now, with NO preceding letter type the 12 diacritics on their own - only 0300, 0301, 0303 and 0323 don't display. (Also check 0309 if included).

4. Now type any consonant (except y) and after it, in turn, type the 11 diacritics - only 0300, 0301, 0303 and 0323 don't display. (Also check 0309 if included).

5. Do the same process in NotePad and WordPad. There will be no display problems. Copy the NotePad and WordPad text into Windows Word - no display problems. Open the saved NotePad and WordPad documents in Windows Word - no display problems.

___________________________________________________________

We regard issue (C) and also the preceding one (B) as extremely serious and urgent and we would be surprised if Microsoft doesn't feel the same way. These presumed bugs are affecting the functionality of the fonts and keyboard in a project that we have been involved with for 6 years which is awaiting release pending the resolution of these issues. We can't imagine that Microsoft would want these bugs to persist any longer now that they have been brought to their attention. We are prepared to cooperate fully (testing, feedback etc) with Microsoft to reach a speedy resolution and look forward to hearing from you on both these issues urgently.

**********************************************************
Kevin Brown's  G R A P H I T Y !  Est.1979
DIGITAL TYPE SPECIALIST  *  GRAPHIC DESIGN

Member: The Unicode Consortium
Member: Australian Graphic Design Association

www.australianschoolfonts.com.au
**********************************************************

[Moved from:  Office / Word / Windows 10 / Office 2016]

 
Question Info

Last updated June 4, 2018 Views 514 Applies to:
Answer
Hi Chris, Grateful thanks to you and your colleague who's handling issue (C) for your prompt attention to these issues. I completely accept your co-worker's qualification that the fix for this longstanding issue will require careful testing etc to avoid unintended consequences. Assuming the fix goes ahead as planned in the Click-to-Run version is there some way that we can be informed that the fix is in place? This would greatly assist with the roll-out of the custom keyboard mentioned in our report. Cheers, Kevin.

Did this solve your problem?

Sorry this didn't help.

Great! Thanks for marking this as the answer.

How satisfied are you with this reply?

Thanks for your feedback, it helps us improve the site.

How satisfied are you with this response?

Thanks for your feedback.

Answer
Hi Kevin,

I'm Chris, a Word engineer.  Thank you for giving us feedback and especially for your detailed observations.  I wanted to let you know your post has come to our attention and we have two engineers looking into your issue; I'm looking at issue (B) and another is looking at issue (C).  I don't know much about the investigation into issue (C), besides that we're looking into it.

I wanted to talk to you about issue B - first about more potential workarounds, and second about the design of this feature.  There are two methods you can use to avoid ambiguity when pressing Alt-X.  One, precede the Unicode hex with a "U+".  So for your a0300 example, type "aU+0300" and press Alt-X, and you'll get the correct result.  Likewise "aU+a0300" will result an an "a" plus the undefined character box, if that's your intent.

Two, Alt-X will also work on a selection.  So if you type "a0300", select "0300", and press Alt-X, you'll get your a with a Combining Grave on top.  If you actually want to convert the entire "a0300" string into a unicode character, include the "a" in your selection.

So with these two workarounds, I want to tell you that I think the behavior you're seeing is the intended design.  If you Alt-X after "a0300", we don't know if you want the U+A0300 character code or the A with U+0300 character code.  Both are perfectly valid Unicode, despite the former not having a usable glyph in common fonts.  We simply continue picking up previous characters until we can no longer make a Unicode value from the result.  We're not even counting to 4 or 5 characters to get a character code; "f12" + alt-x works, as well as "A1" or even "1".  It's just that with your examples, Word has no easy way of determining the user's intent, so we added the workarounds I mentioned above to assist.  There are certainly hard ways, like making educated guesses based on language, font, and document context, but I don't see how they'll be more than guesses.

Finally, I looked into the "x0300" case and I discovered that the "x" notation is designed to be used for ASCII characters.  So "x20" + alt-x converts the string into a space, and "xe9" + alt-x gives you an accented e.

I hope these workarounds resolve your concerns about Alt-X.  If not, or if you have an idea on how we can improve the design, please let me know.

Thanks,
Chris

2 people were helped by this reply

·

Did this solve your problem?

Sorry this didn't help.

Great! Thanks for marking this as the answer.

How satisfied are you with this reply?

Thanks for your feedback, it helps us improve the site.

How satisfied are you with this response?

Thanks for your feedback.