For computing collocation strength, we can use
•the joint frequency of two words: how often they co-occur, which assumes we have an idea of how far away counts as "neighbours". (If you live in London, does a person in Liverpool count as a neighbour? From the perspective of Tokyo, maybe they do. If not, is a person in Oxford? Heathrow?)
•the frequency word 1 altogether in the corpus
•the frequency of word 2 altogether in the corpus
•the span or horizons we consider for being neighbours
•the total number of running words in our corpus: total tokens
Mutual Information
Log to base 2 of (A divided by (B times C))
where
A = joint frequency divided by total tokens
B = frequency of word 1 divided by total tokens
C = frequency of word 2 divided by total tokens
MI3
Log to base 2 of ((J cubed) times E divided by B)
where
J = joint frequency
F1 = frequency of word 1
F2 = frequency of word 2
E = J + (total tokens-F1) + (total tokens-F2) + (total tokens-F1-F2)
B = (J + (total tokens-F1)) times (J + (total tokens-F2))
T Score
(J - ((F1 times F2) divided by total tokens)) divided by (square root of (J))
where
J = joint frequency
F1 = frequency of word 1
F2 = frequency of word 2
Z Score
(J - E) divided by the square root of (E times (1-P))
where
J = joint frequency
S = collocational span
F1 = frequency of word 1
F2 = frequency of word 2
P = F2 divided by (total tokens - F1)
E = P times F1 times S
Dice Coefficient
(J times 2) divided by (F1 + F2)
where
J = joint frequency
F1 = frequency of word 1 or corpus 1 word count
F2 = frequency of word 2 or corpus 2 word count
Ranges between 0 and 1.
Log Likelihood (different corpora)
where
a = frequency of term 1
b = frequency of term 2
c = total words in corpus 1
d = total words in corpus 2
computes
E1 = c*(a+b) / (c+d) and E2 = d*(a+b) / (c+d)
Log Likelihood is
2*((a* Log (a/E1)) + (b* Log (b/E2)))
(using Log to the base e)
BIC Score
is the log likelihood above - Log(c+d).
Log Likelihood (same corpus)
uses
J = joint frequency
F1 = frequency of word 1 or corpus 1 word count
F2 = frequency of word 2 or corpus 2 word count
T = total word count
then computes K11 = Joint; K12 = F1 * collocation span - Joint; K21 = F2 - Joint; K22 = T - F1 - F2 - Joint
as input to a routine explained at Ted Dunning's blog. The use of the collocation span is proposed by Stefan Evert.
Log Ratio
where
a = frequency of term 1
b = frequency of term 2
c = total words in corpus 1
d = total words in corpus 2
computes
Log ((a/c) / (b/d))
(using Log to the base 2)
Conditional Probability (Durrant 2008: 84)
divides the frequency of terms 1 and 2 when together (the joint frequency) by the frequency of term1 (Conditional Probability A) or of term 2 (Conditional Probability B) and multiplies by 100 for better legibility.
Delta Probability (Gries, 2013: 144)
where
j = joint frequency of term 1 with term 2
a = frequency of term 1
b = frequency of term 2
c = total words in corpus
computes (j / b) - ((a-j) / (c-a-b+j) and multiplies the result by 100 for legibility. Very similar to Conditional Probability.
Dispersion (Oakes p. 190)
where
n = number of divisions
m = mean of the frequencies over n divisions
sd = standard deviation of the frequencies
v = sd / m
r = square root of n
computes dispersion as 1 - (v / r)
(Oakes suggests square root of n-1 but square root of n gives slightly better results. Either way he says this is designed to range between 1 and 0 but in practice a very low dispersion such as where all the hits are in one division can compute to less than zero. WordSmith will show results of zero or below as blanks.)
Relative Entropy (Gries, 2010)
where
n = number of measurements
p = the probability of each measurement
computes entropy as the positive sum of (each p * log2 of p) and relative entropy as entropy / log2 of n
See also: link 1 from Lancaster University, link 2 from Lancaster, Mutual Information, plot dispersion