WordList clusters ![WSImage_361_clusters_24](./images/wsimage_361_clusters_24.png)
A word list doesn't need to be of single words. You can ask for a word list consisting of two, three, up to eight words on each line.
Example
Here is a small set of 3-4word clusters involving the word money.
![money_wordlist_clusters](./images/money_wordlist_clusters.png)
Some of them are plausible multi-word units. The number of words in each cluster gets inserted in the Set column.
How to do it
To do cluster processing in WordList, first make an index.
Then open the index. Now choose Compute | Clusters.
Words to make clusters from
•"all" : all the clusters involving all words above a certain frequency (this will be slow for a big corpus like the BNC), or
•"selected": clusters only for words you've selected (e.g. you have highlighted BOOK and BOOKS and you want clusters like book a table, in my book).
•ranging from one character to another
•loaded up from ma plain text file
![cluster_choices_for_index-words](./images/cluster_choices_for_index-words.png)
|
Exclusions
Clusters
![cluster_choices_for_index-clusters](./images/cluster_choices_for_index-clusters.png)
The cluster size must be between 2 and 8 words.
The min. freq. is the minimum number of each that you want to see.
no words containing #: if selected, this won't show any clusters involving numbers and dates
phrase frames like book * hotel: you can choose to include these, exclude them or get only phrase frames.
|
Statistics
Advanced
The "max. frequency %" setting is to speed the process up.
It means the maximum frequency percentage which the calculation of clusters for a given word will process. This is because there are lots and lots of the very high frequency items and you may well not be interested in clusters which begin with them. For example, the item the is likely to be about 6% of any word-list (about 6 million of them in the BNC therefore), and you might not want clusters starting the... -- if so, you might set the max. percent to 0.5% or 0.1% (which for the BNC corpus will cut out the top 102 frequency words). You will still get clusters which include very high frequency items in the middle or end, like the a in book a table, but would not get in my book, which begins with the very high frequency word in. The more words you include, the longer the process will take....
Max. seconds per word is another way of controlling how long the process will take. The default (0) means no limit. But if you set this e.g. to 30 then as WordList processes the words in order, as soon as one has taken 30 seconds no further clusters will be collected starting with that word.
![cluster_choices_for_index-advanced](./images/cluster_choices_for_index-advanced.png)
batch processing allows you to create a whole set of cluster word-lists at one time.
|
Phrase frames
These are what William H. Fletcher has defined as phrase-frames, i.e. "groups of wordgrams identical but for a single word", in his kfNgram program.
Here are phrase frames from Dickens' novel Beak House. The wildcard word is represented with *.
![phrase_frames_BH](./images/phrase_frames_bh.png)
If you join clusters, you can get this:
![clusters involving hand joined](./images/hmfile_hash_082b1b3d.png)
If you double-click the lemmas column, you get to see the detail.
![lemma_clusters](./images/lemma_clusters.png)
The process joins all the variants of the phrase in the Lemmas column. In the word list itself they will appear deleted (because they have been joined to another item, the phrase frame). You can un-join them all if you want (Edit | Joining | Unjoin all).
Omit phrase frames?
If you don't want to see phrase frames, select the omit phrase frames option.
You can delete them with Edit | Deleting | delete phrase frames.
|
It's a word list
Finally, remember this listing is just like a single-word word list. You can save it as a .lst file and open it again at any time, separately from the index.
See also: clusters in single-word list, find the files for specific clusters, clusters in Concord