A corpus like the BNC may be usefully converted in three or four ways:
|1.||In a format which Windows will expect, preferably with a .txt filename so that Windows will open each text easily|
|2.||with the files all stored in folders whose names mean something useful|
|3.||in Unicode, a format which handles all the curly quote marks and dashes unambiguously|
|4.||optionally you may also want a markup-free copy so you can read the texts easily.|
In WordSmith, use Text Converter for this.
I selected these texts (the XML edition has 3 main folders; the ones needed for WordSmith are in the \texts folder)
then into Unicode and dealing with curly quotes etc. as checked below:
The above screenshot was taken as the processing was being done; it took about an hour as there are many thousands of text files. Then I filtered them according to Dave Lee's codes so as to get them into folders that mean something to me!
That took another hour, working across a home network.