﻿ KeyWords: calculation

# WordSmith Tools Help

The "key words" are calculated by comparing the frequency of each word in the word-list of the text you're interested in (study corpus) with the frequency of the same word in another word-list (comparison corpus). All words which appear in the smaller list are considered, unless they are in a stop list.

If the occurs say, 5% of the time in the study corpus and 6% of the time in the comparison corpus, it will not turn out to be "key", though it may well be the most frequent word. If the text concerns the anatomy of spiders, it may well turn out that the names of the researchers, and the items spider, leg, eight, etc. may be more frequent than they would otherwise be in your comparison corpus (unless your comparisonr corpus only concerns spiders!)

To compute the "keyness" of an item, the program therefore computes

its frequency in the small word-list

the number of running words in the small word-list

its frequency in the other corpus

the number of running words in the comparison corpus

and cross-tabulates these.

A word will get into the listing here if it is unusually frequent (or unusually infrequent) in comparison with what one would expect on the basis of the comparison word-list.

Unusually infrequent key-words are called "negative key-words" and appear at the very end of your listing, in a different colour. Note that negative key-words will be omitted automatically from a keywords database and a plot.

type and token coverage

type coverage is the proportion of word types that a key word list has of the word types in the study word list.

token coverage is the proportion of word tokens that a key word list has of the total word tokens in the study word list.

These statistics are shown in the Notes computed after a key word procedure has run .

These notes show basic statistics on a small text where type coverage was 8.06% and token coverage 13.42%.

text dispersion keyness

Egbert and Biber (2019) propose that text dispersion key words can be computed by comparing the number of texts each word is found in in both the study corpus and the reference corpus (instead of comparing word frequencies).