Mutual Information ()
|Top Previous Next|
WordList > mutual information scores
the point of it
A Mutual Information (MI) score relates one word to another. For example, if problem is often found with solve, they may have a high mutual information score. Usually, the will be found much more often near problem than solve, so the procedure for calculating Mutual Information takes into account not just the most frequent words found near the word in question, but also whether each word is often found elsewhere, well away from the word in question. Since the is found very often indeed far away from problem, it will not tend to be related, that is, it will get a low MI score.
This relationship is bi-lateral: in the case of kith and kin, it doesn't distinguish between the virtual certainty of finding kin near kith, and the much lower likelihood of finding kith near kin.
There are various different formulae for computing the strength of collocational relationships. The MI in WordSmith ("specific mutual information") is computed using a formula derived from Gaussier, Lange and Meunier described in Oakes, p. 174; here the probability is based on total corpus size in tokens. Other measures of collocational relation are computed too, which you will see explained under Mutual Information Display.
The Mutual Information settings are found in the Controller under Adjust Settings | Indexing or in a menu option in WordList.
stop at: you can choose where you want collocational breaks to be assumed. With the setting above, "I wrote the letter. Then I posted it" would not consider posted as a possible collocate of letter because there's a sentence break between them.
max. percent: ignores any tokens which are more frequent than the percentage indicated. (The point of this is to avoid computing mutual information for words like the and of, which are likely to have a frequency greater than say 1.0%.)
span: the number of intervening words between collocate and node. With a span of 5, the node wrote would consider the, letter, then, I and posted as possible collocates if stop at were set at no limits.
min. mutual info: the minimum number which the MI must come up with to be reported. A useful limit is 3.0. Below this, the linkage between node and collocate is likely to be rather tenuous.
min. frequency: the minimum frequency for any item to be considered for the mutual information calculation (default = 5). (If an item occurs only once or twice, the mutual information is unlikely to be informative.)
See Oakes for further information about Mutual Information.