The British National Corpus is a valuable resource but has certain problems as it comes straight off the cdrom:
•it is in Unix format
•it has entities like é to represent characters like é
•its structure is opaque and file-names mean nothing
You will find it much easier to use if you
•convert it to Unicode
•filter the files to make a useful structure
as explained at https://lexically.net/wordsmith/Handling_BNC/index.html
The easiest way to do that is in three stages.
Conversion:
After choosing the texts,
After that, select the files you have just converted to Windows format (here at J:\temp\BNC_XML_1) and do another conversion:
(you will find the BNC XML categories file in your Documents\wsmith8 folder) and when you press OK you'll be asked something like this
After the work is done you will see the BNC texts copied to a similar structure (in our case stemming from j:\temp)
Filter
Choose the converted texts in the first window:
de-activate conversion,
and choose filtering like this:
Eventually you should get folder structures like this:
See also: XML simplification