<< Click here to display Table of Contents >> Navigation: Reference:

Formulae

Contents

For computing collocation strength, we can use

•the joint frequency of two words: how often they co-occur, which assumes we have an idea of how far away counts as "neighbours". (If you live in London, does a person in Liverpool count as a neighbour? From the perspective of Tokyo, maybe they do. If not, is a person in Oxford? Heathrow?)

•the frequency word 1 altogether in the corpus

•the frequency of word 2 altogether in the corpus

•the span or horizons we consider for being neighbours

•the total number of running words in our corpus: total tokens

Mutual Information

Log to base 2 of (A divided by (B times C))

where

A = joint frequency divided by total tokens

B = frequency of word 1 divided by total tokens

C = frequency of word 2 divided by total tokens

MI3

Log to base 2 of ((J cubed) times E divided by B)

where

J = joint frequency

F1 = frequency of word 1

F2 = frequency of word 2

E = J + (total tokens-F1) + (total tokens-F2) + (total tokens-F1-F2)

B = (J + (total tokens-F1)) times (J + (total tokens-F2))

T Score

(J - ((F1 times F2) divided by total tokens)) divided by (square root of (J))

where

J = joint frequency

F1 = frequency of word 1

F2 = frequency of word 2

Z Score

(J - E) divided by the square root of (E times (1-P))

where

J = joint frequency

S = collocational span

F1 = frequency of word 1

F2 = frequency of word 2

P = F2 divided by (total tokens - F1)

E = P times F1 times S

Dice Coefficient

(J times 2) divided by (F1 + F2)

where

J = joint frequency

F1 = frequency of word 1 or corpus 1 word count

F2 = frequency of word 2 or corpus 2 word count

Ranges between 0 and 1.

Log Likelihood (different corpora)

where

a = frequency of term 1

b = frequency of term 2

c = total words in corpus 1

d = total words in corpus 2

computes

E1 = c*(a+b) / (c+d) and E2 = d*(a+b) / (c+d)

Log Likelihood is

2*((a* Log (a/E1)) + (b* Log (b/E2)))

(using Log to the base e)

Log Likelihood (same corpus)

uses

J = joint frequency

F1 = frequency of word 1 or corpus 1 word count

F2 = frequency of word 2 or corpus 2 word count

T = total word count

then computes K11 = Joint; K12 = F1 * collocation span - Joint; K21 = F2 - Joint; K22 = T - F1 - F2 - Joint

as input to a routine explained at Ted Dunning's blog. The use of the collocation span is proposed by Stefan Everts.

Log Ratio

where

a = frequency of term 1

b = frequency of term 2

c = total words in corpus 1

d = total words in corpus 2

computes

Log ((a/c) / (b/d))

(using Log to the base 2)

Dispersion (Oakes p. 190)

where

n = number of divisions

m = mean of the frequencies over n divisions

sd = standard deviation of the frequencies

v = sd / m

r = square root of n

computes dispersion as 1 - (v / r)

(Oakes suggests square root of n-1 but square root of n gives slightly better results. Either way he says this is designed to range between 1 and 0 but in practice a very low dispersion such as where all the hits are in one division can compute to less than zero. WordSmith will show results of zero or below as blanks.)

Click the Permalink button if you want to copy a link to this page.