Convert Text File Format
|Top Previous Next|
Utility Programs > Text Converter > Convert Text File Format
To convert a series of whole text files from one format to another, choose between these options:
These formats allow you to convert into formats which will be suited to text processing. (UTF8, a format which was devised for many languages some years ago when disk space was limited and character encoding was problematic, is generally not suitable. That's because it uses a variable number of bytes to represent the different characters. A to Z will be only 1 byte but for example Japanese characters may well need 2, 3 or even more bytes to represent one character.)
DOS to Windows:
... choose the "codepage" that your old DOS texts were encoded with, eg. DOS 850 Multilingual.
Unix to Windows:
... Unix-saved texts don't use the same codes for end-of-paragraph as Windows-saved ones.
.... this is a better standard than ANSI as it allows many more characters to be used, suiting lots of languages. This is UTF16 Unicode, 2 bytes for each character.
from MS Word .doc
... like using "Save as Text" in Word.
HTML/BNC entities to characters
... converts symbols which are hard to read such as é to ones like é
from column tagged, using <> except column
... The Stuttgart Tree Tagger produces output like this:
word pos lemma
The DT the
TreeTagger NP TreeTagger
is VBZ be
easy JJ easy
to TO to
use VB use
. SENT .
If you set the column to 1, Text Converter will convert this to
The<DT><the> TreeTagger<NP><TreeTagger> is<VBZ><be> easy<JJ><easy> to<TO><to> use<VB><use> .<SENT><.>
Lemmatised using ...
... converts each file using a lemma file. Where if your source text has "she was tired" and your lemma file has BE -> AM, WAS, WERE, IS, ARE, then you will get "she be tired" in your converted text file. Where your source text has "Was she tired?" you'll get "Be she tired?"