Convert Text File Format |
Top Previous Next |
Utility Programs > Text Converter > Convert Text File Format To convert a series of whole text files from one format to another, choose between these options:
These formats allow you to convert into formats which will be suited to text processing. (UTF8, a format which was devised for many languages some years ago when disk space was limited and character encoding was problematic, is generally not suitable. That's because it uses a variable number of bytes to represent the different characters. A to Z will be only 1 byte but for example Japanese characters may well need 2, 3 or even more bytes to represent one character.)
DOS to Windows: ... choose the "codepage" that your old DOS texts were encoded with, eg. DOS 850 Multilingual.
Unix to Windows: ... Unix-saved texts don't use the same codes for end-of-paragraph as Windows-saved ones.
into Unicode: .... this is a better standard than ANSI as it allows many more characters to be used, suiting lots of languages. This is UTF16 Unicode, 2 bytes for each character.
from MS Word .doc ... like using "Save as Text" in Word.
HTML/BNC entities to characters ... converts symbols which are hard to read such as é to ones like é
from column tagged, using <> except column ... The Stuttgart Tree Tagger produces output like this:
word pos lemma The DT the TreeTagger NP TreeTagger is VBZ be easy JJ easy to TO to use VB use . SENT .
If you set the column to 1, Text Converter will convert this to
The<DT><the> TreeTagger<NP><TreeTagger> is<VBZ><be> easy<JJ><easy> to<TO><to> use<VB><use> .<SENT><.>
Lemmatised using ... ... converts each file using a lemma file. Where if your source text has "she was tired" and your lemma file has BE -> AM, WAS, WERE, IS, ARE, then you will get "she be tired" in your converted text file. Where your source text has "Was she tired?" you'll get "Be she tired?"
|