WordSmith Tools Manual

Navigation: Reference

flavours of Unicode

What is Unicode?

What WordSmith requires for many languages (Russian, Japanese, Greek, Vietnamese, Arabic etc.) is Unicode. (Technically UTF16 Unicode, little-endian.) It uses 2 bytes for each character. One byte is not enough space to record complex characters, though it will work OK for the English alphabet and some simple punctuation and number characters.

UTF8, a format which was devised for many languages some years ago when disk space was limited and character encoding was problematic, is in widespread use but is generally not suitable. That's because it uses a variable number of bytes to represent the different characters. A to Z will be only 1 byte but for example Japanese characters may well need 2, 3 or even more bytes to represent one character.

There are a number of different "flavours" of Unicode as defined by the Unicode Consortium.

MS Word offers

•Unicode

•Unicode (Big-Endian) (generated by some Mac or Unix software)

•Unicode (UTF-7)

•Unicode (UTF-8)

The last two are 1-byte versions, not really Unicode in my opinion. WordSmith wants the first of these but should automatically convert from any of the others. If you are converting text, prefer Unicode (little-endian), UTF16.

Technical Note

There are other flavours too and there is much more complexity to this topic than can be explained here, but essentially what we are trying to achieve is a system where a character can be stored in the PC in a fixed amount of space and displayed correctly.

Precomposed

In a few cases in certain languages, some of your texts may have been prepared with a character followed by an accent, such as A followed by ^ where the intention is for the software to display them merged (Â), instead of using precomposed characters where the two are merged in the text file. See the explanation in Advanced Settings if you need to handle that situation.