Please enable JavaScript to view this site.

WordSmith Tools Help

The point of it

 

The aim here is to find repeated chunks, such as can get caused

oif someone has inserted a paragraph twice by mistake

oby plagiarism

oby re-writing and editing text

oin copying and pasting,

oby quoting,

oor standard chunks of text used as jargon or for convenience.

 

The procedure looks essentially for repeated sentences and headings in a whole lot of texts.

 

How to do it

 

Press the Start button to choose a folder. It will be searched as can all its sub-folders, except any called filtered or moved.

Choose the file-types to search (default *.* ) and a tag span such as 200 characters, since mark-up gets ignored in this search. Set the minimum number of hits and the minimum number of hits there must be per text file. Min. length is the length of any repeated chunk.

Include unterminated sentences: includes headings.

 

Press the Start button. Here we should get any repeated chunks over 25 characters which come at least 15 times overall, at least once in each text file.

 

boilerplate_searching

The program has examined over 7,200 text files (out of about 9,000) and has found a lot of chunks. Most are not repeated as we set the minimum per file to 1.

At the end of the search (it took about a minute), the program found about 330,000 repeated chunks, reduced by these settings to just under 23 entries.

 

Most are like this:

boilerplates_Found

Messages to the reader about legal responsibilities or offering choices to readers, not directly concerned with the main text content..

 

Some like cluster 21 more concerned with text content:

 

boiler_cluster_21_weather

A concordance shows this:

boiler_cluster_21_weather_concordanced

Very clearly this boilerplate chunk is a quote from an expert during a heatwave.

Focus on chunks or files?

The screen-shots above show a focus on Chunks.

 

If you change the focus to Files, you get the list ordered according to which files contain the most different boilerplate chunks. (Press the Freq. header until you get the highest to the top)

 

boiler_file_oriented

The list looks more specific and topic-focussed because the text file-names in this study are very informative. The top few seem to relate a lot to climate change and weather.

 

Further Options

At the top are various buttons offering options:

Show this text

which for the Nebraska text showed this:

 

Nebraska climate text

Highlighting all (right-click) got this:

 

Nebraska climate text highlighted

showing (for this text) that all 6 cases of 5 boilerplate strings come right at the end of the text once the main text message is finished.

You can click the marks in the plot to see each of the references to the same piece of boilerplate. Or go through them with the < and > arrows.

 

tog_plus        Extreme boilerplate

 

Some texts contain huge amounts of repetition. This example is from "live" texts where journalists keep editing and adding to a story building. In this case it was a story about temperature records and many strings got repeated. It is a bit like the difference between a photo and a movie where the same image slightly varied gets repeated.

 

guardian_weather_text_highlighted_all

 

 

Compute Concordance

This passes the set of file-names to Concord and the chunk of boilerplate text, for concordancing.

Save and Load

These save or retrieve a saved listing.

 

Save listing

This lets you save the listing, either to the clipboard from which you can paste it wherever you like, or to an Excel spreadsheet:

Excel_of_boilerplate

 

See also: duplicate file contents, corruption check, duplicate file-names,

  

Keyboard Navigation

F7 for caret browsing
Hold ALT and press letter

This Info: ALT+q
Nav Header: ALT+n
Page Header: ALT+h
Topic Header: ALT+t
Topic Body: ALT+b
Exit Menu/Up: ESC