similar content

The point of it

News texts are often edited slightly and re-published the same day or later or in a sister publication. For example if a name or title is not correct, or if details need changing.

The idea is to be able to check whether you have virtually duplicate text files, i.e. with the same or similar contents. This search compares all text files with each other regardless of their file-names. If duplicates are found you can move and rename them so they can be avoided in further research.

How to do it

Specify your Folder to search and file-type(s). Check the difference settings, then simply press "Search". Search will go through that folder and any sub-folders and will report any duplicates found. Note: it will take time, as it needs to compare each text file with all the others in those folders. The comparison checks the vocabulary used in each pair of texts.

dupl_contents_search

Here the status bar shows 95 clusters of duplicates have been found and the process is now working on text file 3763 (of over 9,000) text files. (A cluster is a text with at least one other that has duplicate contents.) The orange progress bar is at about 40%. The process will finish in about 33 minutes.

Settings

An intrinsically slow process can be speeded up by reducing the number of text files which get compared with each other.

Max length diff = how much difference in length will you accept in a comparison? The bigger any length differences the slower the process. In the case above a setting of 15.0% was used, so any a text of exactly 100Kb in size would only get compared with others between 85 and 115Kb in size.

Max words diff = how much difference in numbers of types or tokens will you allow and still consider the texts duplicates?

Check dates: if this is selected, you can decide how many days apart text files can be.

Same folder: if this is checked, only text files in the same folder get compared. In the study above that meant only texts in the same newspaper got compared.

Pause the process? Well, you can stop it (press the red exclamation mark).

Results

duplicate_contents_search

Cluster 2 shows two text files with 569/561 types published on the same day.

Right-clicking and choosing Show reveals minor differences in any two or more texts:

similar_texts_topA

similar_texts_topB not highlighted

Right-clicking the latter window and choosing Highlight similarities gave this:

similar_texts_topB

Angle-bracketed lines are not compared. In the rest of the text, identical lines get coloured yellow. Most of the rest of the text is coloured yellow. Here what has happened is that an editor has changed New Scotland Yard to its London HQ. It looks as if minor edits were performed but otherwise text 3 and 5 are duplicates.

Copy, Move, Show, Keep

When the search finishes you can press Move to filter duplicates to a moved sub-folder.

For example, J:\mycorpus\study3\section6\table3.txt, if it's to be moved, will be placed in J:\mycorpus\study3\section6\moved-contents and saved there as table3.moved_contents.

Use the Restore any moved button to go through all the set, putting any moved back where they belong. (You may have to close the Corpus Checker first.)

Right-clicking the display of duplicate files gives these options:

duplicate_contents_right_click_menu

To view, choose Show in the right-click menu. Or double-click to open any file in Notepad. Copy produces a list of the results which you can paste elsewhere. Remove simply moves all the duplicates which are marked for deletion, to a sub-folder of where they were before, called "moved", and changes the file extension to .moved. To change which duplicate is kept, select the one you do want and press Keep.

How does it check duplication?

This Info:	ALT+q
Nav Header:	ALT+n
Page Header:	ALT+h
Topic Header:	ALT+t
Topic Body:	ALT+b
Exit Menu/Up:	ESC

Please enable JavaScript to view this site.

WordSmith Tools Help

Keyboard Navigation