WordSmith Tools Manual

Corpus Checker

relevance check

The point of it...

The aim is to filter your corpus by checking which of certain words or phrases are found or not found in each text. It operates a scoring system. You specify words or phrases which you believe are typical of the field you're investigating and can specify some which you see as unwanted distractors. Text files which score highly can then be copied or moved to a location of your choice.






I was studying austerity in news text. Lots of articles mentioned austerity, sometimes incidentally. And I wanted to study austerity in Britain but a lot of articles concerned Greece. So my filters had terms like cost-cutting, UK etc. and my negative filters included Greek, Greece etc. To get a suitable corpus I wanted quite a lot of the positive terms I preferred and few of the negative ones. After the relevance check was done I was able to filter out most of the texts leaving only ones which were much more relevant to my enquiry.


The ones visible scored above the Min. score setting. Press Filter relevant texts, and those scoring highly enough get copied to a sub-folder called "filtered".




The procedure uses search-terms. It doesn't actually understand the text. All it can do is give a higher score to the presence of positive terms and reduce the score if negative ones are found. Texts about the environment don't necessrily contain the word environment!  


Choosing your relevance filters

A useful idea is to compute the key words and key clusters of your imperfect corpus first. That will help you find words and phrases that characterise your corpus. Use some of them plus any others you think will be plausible. Then try the relevance filter. Separate out say 100 texts and read (some of) them carefully to see how well you're doing. Edit the relevance filters to refine them.