single words v. clusters

The point of it…

Clusters are words which are found repeatedly together in each others' company, in sequence. They represent a tighter relationship than collocates, more like multi-word units or groups or phrases. (I call them clusters because groups and phrases already have uses in grammar and because simply being found together in software doesn't guarantee they are true multi-word units.) Biber calls them "lexical bundles".

Language is phrasal and textual. It is not helpful to see it as a matter of selecting a word to fill a grammatical "slot" as implied by structural theories. Words keep company: the extreme example is idiom where they're bound tightly to each other, but all words have a tendency to cluster together with some others. These clustering relations may involve colligation (e.g. the relationship between depend and on), collocation, and semantic prosody (the tendency for cause to come with negative effects such as accident, trouble, etc.).

WordSmith Tools gives you two opportunities for identifying word clusters, in WordList and Concord. They use different methods. Concord only processes concordance lines, while WordList processes whole texts.

How they are computed …

Suppose your text begins like this:

Once upon a time, there was a beautiful princess. She snored. But the prince didn't.

If you've chosen 2-word clusters, the text will be split up as follows:

Once upon

upon a

a time

(note not "time there" because of the comma)

there was (etc.)

With a three-word cluster setting, it would send

Once upon a

upon a time

there was a

was a beautiful

a beautiful princess

But the prince

the prince didn't

(etc.)

That is, each n-word cluster will be stored, if it reaches n words in length, up to a punctuation boundary, marked by ;,.!? (It seems reasonable to suppose that a cluster does not cross clause boundaries and these punctuation symbols help mark clause boundaries, but there is a Concord setting or a WordList setting for this to give you choice.)