A corpus like the BNC may be usefully converted in three or four ways:
1. | In a format which Windows will expect, preferably with a .txt filename so that Windows will open each text easily |
2. | with the files all stored in folders whose names mean something useful |
3. | in Unicode, a format which handles all the curly quote marks and dashes unambiguously |
4. | optionally you may also want a markup-free copy so you can read the texts easily. |
In WordSmith, use Text Converter for this.
I selected these texts (the XML edition has 3 main folders; the ones needed for WordSmith are in the \texts folder)
then into Unicode and dealing with curly quotes etc. as checked below:
The above screenshot was taken as the processing was being done; it took about an hour as there are many thousands of text files. Then I filtered them according to Dave Lee's codes so as to get them into folders that mean something to me!
That took another hour, working across a home network.
Page url: http://www.lexically.net/wordsmith/Handling_BNC/index.html?alternative_formats_for_bnc.htm