The idea here is to mark up your corpus with clusters or phrases you want treated as single items.
You can do that in 2 ways:
•insert _ so Los Angeles becomes Los_Angeles and New York = New_York (yellow ring)
•annotate the text so Los Angeles becomes <mwu>Los Angeles</mwu>. (grey ring)
For either method you'll need a phrase file which contains the items you're interested in. Mine contained just this:
After processing (using the tag insertion method) my source text looked like this:
with <mwu> before and </mwu> after each item found.
Whether you choose case insensitive or sensitive, the replacement will match the case in your phrase file. If your phrase file has a lot then a case insensitive search will also find A lot or a LOT.
Original: A lot of people ...
Conversion: <mwu>a lot</mwu> of people ...
Handling the text now it has been modified
With method 1, you merely need to teach WordSmith that the underscore character is to be accepted as a valid character.