WordSmith Tools Manual

Navigation: Utility Programs > Corpus Checker

relevance check

The point of it...

The aim is to filter your corpus by checking which of certain words or phrases are found or not found in each text. It operates a scoring system. You specify words or phrases which you believe are typical of the field you're investigating and can specify some which you see as unwanted distractors. Text files which score highly can then be copied or moved to a location of your choice.

Settings

Choose a list of filter strings, a minimum word count and a preferred minimum score.

relevance_checking_start

coverage: you can choose a number of equal-size segments of each text which any hits must be found in.Here, the settings mean the texts will be segmented into 5 sections (the first 20%, the second 20% and so on) and any three of them must have at least one hit.

Filter strings syntax

Which texts to consider?

When you press the check_button button you will get a choice between all the text files in a folder and sub-folders, or a previously-made list, such as that created in the Corpus Sampler procedure.

Folder_or_list_choosing

repeated chunks

Display

relevance_checking

The display shows scores, number of words and number of different hit-types found. In the status bar at the bottom you see that this search filtered out just over half of the 35,676 text files. The score is just a number. It will vary according to how much value you attach to a search-term, the number of search terms you seek. In the search above the values averaged about 20; there were 91 search terms.

Example

I was studying austerity in news text. Lots of articles mentioned austerity, sometimes incidentally. And I wanted to study austerity in Britain but a lot of articles concerned Greece. So my filters had terms like cost-cutting, UK etc. and my negative filters included Greek, Greece etc. To get a suitable corpus I wanted quite a lot of the positive terms I preferred and few of the negative ones. After the relevance check was done I was able to filter out most of the texts leaving only ones which were much more relevant to my enquiry.

Filter relevant texts button

RTF Sample button

See the text

To see one of the texts simply select it and double-click (or right-click and choose Show this text). When it appears you will again be able to right-click in order to save it as RTF, grey out any < > sections, highlight the positive filters, etc. After highlighting all, we get

relevance_showing_text

You see the search terms coloured, and to the right a dispersion plot showing where they appear in the text.

Limitation

The procedure uses search-terms. It doesn't actually understand the text. All it can do is give a higher score to the presence of positive terms and reduce the score if negative ones are found. Texts about the environment don't necessarily contain the word environment!

Choosing your relevance filters

A useful idea is to compute the key words and key clusters of your imperfect corpus first. That will help you find words and phrases that characterise your corpus. Use some of them plus any others you think will be plausible. Also, read a sample of texts carefully to check what sort of corpus you really got.

Finally try the relevance filter. You can sample the texts to see how well you're doing. Edit the relevance filters to refine them.

See also: RTF sample, corpus sampler, which helps you separate the desired sample out.