Single words v. Clusters in WordList

Top  Previous  Next

WordList > index clusters

 

WordList clusters

A word list doesn't need to be of single words. You can ask for a word list consisting of two, three, up to eight words on each line. To do cluster processing in WordList, first make an index.

 

How to see clusters…

Open the index. Now choose Compute | Clusters.

 

cluster_choices_for_index

 

Words to make clusters from

"all" : all the clusters involving all words above a certain frequency (this will be s-l-o-w for a big corpus like the BNC World), or
"selection": clusters only for words you've selected (eg. you have highlighted BOOK and BOOKS and you want clusters like book a table, in my book).

 

To choose words which aren't next to each other, press Control and click in the number at the left -- keep Control held down and click elsewhere. The first one clicked will go green and the others white. In the picture below, using an index of the BNC World corpus, I selected world and then life by clicking numbers 164 and 167.

 

choosing_by_marking

 

The process will take time. In the case of BNC World, the index knows the positions of all of the 100 million words. To find 3-word clusters, in the case above, it took about a minute to process all the 115,000 cases of world and life and find 5,719 clusters like the world bank and of real life. Chris Tribble tells me it took his PC 36 hours to compute all 3-word clusters on the whole BNC ... he was able to use the PC in the meantime but that's not a job you're going to want to do often.

 

What you see

The "cluster size" must be between 2 and 8 words.

The "min. frequency" is the minimum number of each that you want to see.

Here the user has chosen to see any 3-word clusters that appear 5 or more times.

 

Working constraints

The "max. frequency %" setting is to speed the process up. It means the maximum frequency percentage which the calculation of clusters for a given word will process. This is because there are lots and lots of the very high frequency items and you may well not be interested in clusters which begin with them. For example, the item the is likely to be about 6% of any word-list (about 6 million of them in the BNC therefore), and you might not want clusters starting the... -- if so, you might set the max. percent to 0.5% or 0.1% (which for the BNC World corpus will cut out the top 102 frequency words). You will still get clusters which include very high frequency items in the middle or end, like the a in book a table, but would not get in my book, which begins with the very high frequency word in. The more words you include, the longer the process will take....

Max. seconds per word is another way of controlling how long the process will take. The default (0) means no limit. But if you set this e.g. to 30 then as WordList processes the words in order, as soon as one has taken 30 seconds no further clusters will be collected starting with that word.

 

Stop at, like Concord clusters, offers a number of constraints, such as sentence and other punctuation-marked breaks. The idea is that a 5-word cluster which starts in one sentence and continues in the next is not likely to make much sense.

 

What they look like

 

clusters_of_rabies

 

Here is a small set of 3-word clusters involving rabies from the BNC World corpus. Some of them are plausible multi-word units. All clusters which appear at least 5 times are shown: to alter that setting, choose Adjust Settings | Index in the Controller and set the "show if frequency.." number thus:

 

cluster_settings_in_controller

 

See also: clusters in Concord