How Key Words are calculated


The "key words" are calculated by comparing the frequency of each word in the word-list of the text you're interested in with the frequency of the same word in the reference word-list. All words which appear in the smaller list are considered, unless they are in a stop list.


If the occurs say, 5% of the time in the small word-list and 6% of the time in the reference corpus, it will not turn out to be "key", though it may well be the most frequent word. If the text concerns the anatomy of spiders, it may well turn out that the names of the researchers, and the items spider, leg, eight, etc. may be more frequent than they would otherwise be in your reference corpus (unless your reference corpus only concerns spiders!)


To compute the "key-ness" of an item, the program therefore computes

its frequency in the small word-list

the number of running words in the small word-list

its frequency in the reference corpus

the number of running words in the reference corpus

and cross-tabulates these.


Two statistical tests are computed:

Ted Dunning's Log Likelihood test, which measures keyness in terms of the statistical significance and is considered more appropriate than chi-square, especially when contrasting long texts or a whole genre against your reference corpus.

Log ratio: Andrew Hardie's procedure emphasizing the size of of the keyness as opposed to its statistical significance (related to the %DIFF procedure from Costas Gabrielatos & Anna Marchi but which produces smaller numbers and easier to understand). A value of 2 means the item is 4 times more frequent in the small word list than in the reference corpus list. A value of 3 means it's 8 times more frequent, and of 4 means it's 16 times more frequent.


See UCREL's log likelihood site for more on these.


A word will get into the listing here if it is unusually frequent (or unusually infrequent) in comparison with what one would expect on the basis of the larger word-list.


Unusually infrequent key-words are called "negative key-words" and appear at the very end of your listing, in a different colour. Note that negative key-words will be omitted automatically from a keywords database and a plot.


Words which do not occur at all in the reference corpus are treated as if they occurred 5.0e-324 times (0.0000000 and loads more zeroes before a 5) in such a case. This number is so small as not to affect the calculation materially while not crashing the computer's processor.


Click the Permalink button if you want to copy a link to this page.