Alternative formats for the BNC

  Previous topic Next topic JavaScript is required for the print function  

A corpus like the BNC may be usefully converted in three or four ways:

 

1.In a format which Windows will expect, preferably with a .txt filename so that Windows will open each text easily
2.with the files all stored in folders whose names mean something useful
3.in Unicode, a format which handles all the curly quote marks and dashes unambiguously
4.optionally you may also want a markup-free copy so you can read the texts easily.

 

In WordSmith, use Text Converter for this.

 

I selected these texts (the XML edition has 3 main folders; the ones needed for WordSmith are in the \texts folder)

 

choosing_bnc_xml_texts_In_TextConverter

then into Unicode and dealing with curly quotes etc. as checked below:

text_converter_converting_BNC_XML

The above screenshot was taken as the processing was being done; it took about an hour as there are many thousands of text files. Then I filtered them according to Dave Lee's codes so as to get them into folders that mean something to me!

 

text_converter_filtering_all_bnc_xml_classcodes

That took another hour, working across a home network.

 

 

 

Page url: http://www.lexically.net/wordsmith/Handling_BNC/index.html?alternative_formats_for_bnc.htm