Show/Hide Toolbars

WordSmith Tools Manual

Navigation: Utility Programs > Corpus Checker

boilerplate text

Scroll Prev Top Next More

The point of it


The aim here is to find repeated chunks, such as can get caused

oif someone has inserted a paragraph twice by mistake

oby plagiarism

oby re-writing and editing text

oin copying and pasting.


The procedure looks essentially for repeated sentences and headings in a whole lot of texts.


How to do it


In the Settings tab with the yellow oval below, choose a folder. It will be searched as will all its sub-folders.

Choose the file-types to search (default *.* ) and a tag span such as 200 characters, since mark-up gets ignored in this search. Set the minimum number of hits: the number of repetitions which you're interested in seeing in any text file. Min. length is the length of any repeated chunk.

Include unterminated sentences: includes headings.


Press boiler_start_button.



You may get results like this:



In the first few cases a chunk has been found in a number of different text files. In the highlighted case, a sentence beginning "I have also campaigned" is found twice in the text file A00.txt (British National Corpus A00.xml).


Right-click to see the text in question.



Press either of the two buttons shown by the red arrow to jump from one highlighted chunk to the next.



Notice the word with the red circle. In the top context, we have diagnosed, and below we get diagnoses. Perhaps someone edited the text and by mistake left a copy towards the bottom of the text.


See also: duplicate file contents, corruption check, duplicate file-names,