Please enable JavaScript to view this site.

WordSmith Tools Help

Navigation: Reference

flavours of Unicode

Scroll Prev Top Next More

What is Unicode?

 

What WordSmith requires for many languages (Russian, Japanese, Greek, Vietnamese, Arabic etc.) is Unicode. (Technically UTF16 Unicode, little-endian.) It uses 2 bytes for each character. One byte is not enough space to record complex characters, though it will work OK for the English alphabet and some simple punctuation and number characters.

 

UTF8, a format which was devised for many languages some years ago when disk space was limited and character encoding was problematic, is in widespread use but is generally not suitable. That's because it uses a variable number of bytes to represent the different characters. A to Z will be only 1 byte but for example Japanese characters may well need 2, 3 or even more bytes to represent one character.

 

There are a number of different "flavours" of Unicode as defined by the Unicode Consortium.

MS Word offers

Unicode

Unicode (Big-Endian) (generated by some Mac or Unix software)

Unicode (UTF-7)

Unicode (UTF-8)

 

The last two are 1-byte versions. WordSmith wants the first of these but should automatically convert from any of the others. If you are converting text, prefer Unicode (little-endian), UTF16.

 

Technical Note

There are other flavours too and there is much more complexity to this topic than can be explained here, but essentially what we are trying to achieve is a system where a character can be stored in the PC in a fixed amount of space and displayed correctly.