Top  Previous  Next

Reference > formulae

For computing collocation strength, we can use


the joint frequency of two words: how often they co-occur, which assumes we have an idea of how far away counts as "neighbours". (If you live in London, does a person in Liverpool count as a neighbour? From the perspective of Tokyo, maybe they do. If not, is a person in Oxford? Heathrow?)
the frequency word 1 altogether in the corpus
the frequency of word 2 altogether in the corpus
the span or horizons we consider for being neighbours
the total number of running words in our corpus: total tokens



Mutual Information


Log to base 2 of (A divided by (B times C))


A = joint frequency divided by total tokens

B = frequency of word 1  divided by total tokens

C = frequency of word 2  divided by total tokens




Log to base 2 of ((J cubed) times E divided by B)


J = joint frequency

F1 = frequency of word 1

F2 = frequency of word 2

E = J + (total tokens-F1) + (total tokens-F2) + (total tokens-F1-F2)

B = (J + (total tokens-F1)) times (J + (total tokens-F2))


Z Score


(J - E) divided by the square root of (E times (1-P))


J = joint frequency

S = collocational span

F1 = frequency of word 1

F2 = frequency of word 2

P = F2 divided by (total tokens - F1)

E = P times F1 times S



Log Likelihood

based on Oakes p. 170-2.

2 times (

       a Ln a + b Ln b + c Ln c + d Ln d

       - (a+b) Ln (a+b)

       - (a+c) Ln (a+c)

       - (b+d) Ln (b+d)

       - (c+d) Ln (c+d)

       + (a+b+c+d) Ln (a+b+c+d)



a = joint frequency

b = frequency of word 1

c = frequency of word 2

d := frequency of pairs involving neither w1 nor w2

and "Ln" means Natural Logarithm


See also: this link from Lancaster University, Mutual Information