WordSmith Tools Help

Zoom Window Out
Larger Text | Smaller Text
Hide Page Header
Show Expanding Text
Print Topic
Share This Topic
Save Permalink URL

Navigation: Utility Programs > Text Converter > Convert within the text file

multi-word units

Scroll Prev Top Next More

The idea here is to mark up your corpus with clusters or phrases you want treated as single items.

You can do that in 2 ways:

•insert _ so Los Angeles becomes Los_Angeles and New York = New_York (top box)

and/or

•annotate the text so Los Angeles becomes <mwu>Los Angeles</mwu>. (lower box)

text_converter_MWU_window

For either method you'll need a phrase file which contains the items you're interested in. Mine contained just this:

LANY_txt

After processing (using the tag insertion method) my source text looked like this:

LANY_in_corpus

with <mwu> before and </mwu> after each item found.

Whole word search

The search is a whole word search. It will find New York in ... New York's people but not in that's very New Yorkish ....

Case sensitivity

Whether you choose case insensitive or sensitive only determines whether the search succeeds or fails. (If your phrase file has new york then a case insensitive search will also find New York, new York or New york.)

Replacement

Replacements will match the case in the original text file.

Original: New York people ... they were in a new York restaurant ...

Conversion: <mwu>New York</mwu> people ... they were in a <mwu>new York</mwu> restaurant ...

Handling the text now it has been modified

With method 1, you merely need to teach WordSmith that the underscore character is to be accepted as a valid character.

With method 2, you merely have to let WordSmith handle your mark-up to make a word list with clusters in single-word list.

Made a mistake?

Run the procedure again with the same texts and list of multi-word items but choosing remove this markup.

Please enable JavaScript to view this site.

WordSmith Tools Help