Show/Hide Toolbars

WordSmith Tools Manual

Navigation: Utility Programs > Corpus Checker

boilerplate text

Scroll Prev Top Next More

The point of it

 

The aim here is to find repeated chunks, such as can get caused

oif someone has inserted a paragraph twice by mistake

oby plagiarism

oby re-writing and editing text

oin copying and pasting.

 

The procedure looks essentially for repeated sentences and headings in a whole lot of texts.

 

How to do it

 

In the Settings tab with the yellow oval below, choose a folder. It will be searched as will all its sub-folders.

Choose the file-types to search (default *.* ) and a tag span such as 200 characters, since mark-up gets ignored in this search. Set the minimum number of hits: the number of repetitions which you're interested in seeing in any text file. Min. length is the length of any repeated chunk.

Include unterminated sentences: includes headings.

 

Press boiler_start_button.

 

boilerplate_text_settings

You may get results like this:

 

boilerplate_text_list

In the first few cases a chunk has been found in a number of different text files. In the highlighted case, a sentence beginning "I have also campaigned" is found twice in the text file A00.txt (British National Corpus A00.xml).

 

Right-click to see the text in question.

 

boilerplate_text_source_view1

Press either of the two buttons shown by the red arrow to jump from one highlighted chunk to the next.

 

boilerplate_text_source_view2

Notice the word with the red circle. In the top context, we have diagnosed, and below we get diagnoses. Perhaps someone edited the text and by mistake left a copy towards the bottom of the text.

 

See also: duplicate file contents, corruption check, duplicate file-names,