Convert Text File Format

Top  Previous  Next

Utility Programs > Text Converter > Convert Text File Format

To convert a series of whole text files from one format to another, choose between these options:




These formats allow you to convert into formats which will be suited to text processing. (UTF8, a format which was devised for many languages some years ago when disk space was limited and character encoding was problematic, is generally not suitable. That's because it uses a variable number of bytes to represent the different characters. A to Z will be only 1 byte but for example Japanese characters may well need 2, 3 or even more bytes to represent one character.)


DOS to Windows:

... choose the "codepage" that your old DOS texts were encoded with, eg. DOS 850 Multilingual.


Unix to Windows:

... Unix-saved texts don't use the same codes for end-of-paragraph as Windows-saved ones.


into Unicode:

.... this is a better standard than ANSI as it allows many more characters to be used, suiting lots of languages. This is UTF16 Unicode, 2 bytes for each character.


from MS Word .doc

... like using "Save as Text" in Word.


HTML/BNC entities to characters

... converts symbols which are hard to read such as é to ones like é


from column tagged, using <> except column

... The Stuttgart Tree Tagger produces output like this:


word          pos          lemma 

The          DT          the 

TreeTagger          NP          TreeTagger 

is          VBZ          be 

easy          JJ          easy 

to          TO          to 

use          VB          use 

.          SENT          . 


If you set the column to 1, Text Converter will convert this to


The<DT><the> TreeTagger<NP><TreeTagger> is<VBZ><be> easy<JJ><easy> to<TO><to> use<VB><use> .<SENT><.>


Lemmatised using ...

... converts each file using a lemma file. Where if your source text has "she was tired" and your lemma file has BE -> AM, WAS, WERE, IS, ARE, then you will get "she be tired" in your converted text file. Where your source text has "Was she tired?" you'll get "Be she tired?"