Alternative formats for the BNC

A corpus like the BNC may be usefully converted in three or four ways:


1.In a format which Windows will expect, preferably with a .txt filename so that Windows will open each text easily
2.with the files all stored in folders whose names mean something useful Unicode, a format which handles all the curly quote marks and dashes unambiguously
4.optionally you may also want a markup-free copy so you can read the texts easily.


In WordSmith, use Text Converter for this.


I selected these texts (the XML edition has 3 main folders; the ones needed for WordSmith are in the \texts folder)



then into Unicode and dealing with curly quotes etc. as checked below:


The above screenshot was taken as the processing was being done; it took about an hour as there are many thousands of text files. Then I filtered them according to Dave Lee's codes so as to get them into folders that mean something to me!



That took another hour, working across a home network.




