Show/Hide Toolbars

WordSmith Tools Manual

Navigation: Utility Programs > Corpus Checker

relevance check

Scroll Prev Top Next More

The point of it...

The aim is to filter your corpus by checking which of certain words or phrases are found or not found in each text. It operates a scoring system. You specify words or phrases which you believe are typical of the field you're investigating and can specify some which you see as unwanted distractors. Text files which score highly can then be copied or moved to a location of your choice.



Choose a list of filter strings, a minimum word count and a preferred minimum score.


Filter strings syntax

The filter strings must have = at the end of each. Any which start with ~ count negatively.

As each is found it will score 1 point but you can increase that by adding a value so in these filter strings:

cutting spending=

spending cuts=2

spending review=




spending cuts counts twice as much as the others.


Which texts to consider?

When you press the check_buttonbutton you will get a choice between all the text files in a folder and sub-folders, or a previously-made list, such as that created in the Corpus Sampler procedure.







I was studying austerity in news text. Lots of articles mentioned austerity, sometimes incidentally. And I wanted to study austerity in Britain but a lot of articles concerned Greece. So my filters had terms like cost-cutting, UK etc. and my negative filters included Greek, Greece etc. To get a suitable corpus I wanted quite a lot of the positive terms I preferred and few of the negative ones. After the relevance check was done I was able to filter out most of the texts leaving only ones which were much more relevant to my enquiry.


The ones visible scored above the Min. score setting. Press Filter relevant texts, and those scoring highly enough get copied to a sub-folder called "filtered".



See the text

To see one of the texts simply select it and right-click. Choose Show this text. When it appears you will again be able to right-click in order to save it as RTF, grey out any < > sections, highlight the positive filters, etc.





The procedure uses search-terms. It doesn't actually understand the text. All it can do is give a higher score to the presence of positive terms and reduce the score if negative ones are found. Texts about the environment don't necessarily contain the word environment!  


Choosing your relevance filters

A useful idea is to compute the key words and key clusters of your imperfect corpus first. That will help you find words and phrases that characterise your corpus. Use some of them plus any others you think will be plausible. Also, read a sample of texts carefully to check what sort of corpus you really got. To get the sample, use the corpus sampler here.


Finally try the relevance filter. You can use the corpus sampler again on the filtered texts, reading (some of) them carefully to see how well you're doing. Edit the relevance filters to refine them.


See also: corpus sampler, which helps you separate the desired sample out.