WordSmith Tools Manual

Navigation: Utility Programs > Corpus Checker > find duplicates

Find duplicate contents by examining the contents

The point of it

The idea is to be able to check whether you have virtually duplicate text files, i.e. with the same or similar contents. This search compares all text files with each other regardless of their file-names.


How to do it

Specify your Folder to search and file-type(s). Choose a percentage of difference, then simply press "Search". Search will go through that folder and any sub-folders and will report any duplicates found. Note: it will take time, as it needs to compare each text file with all the others in those folders. The comparison checks the vocabulary used in each pair of texts.





The highlighted line shows two text files within 4% difference in date and size and with contents which matched in word types and word token counts within 4%.  Both from the same paper, 1st August 2011, 29 token differences.

At the right you can see a keep column, where by default one of the two text files is marked for deletion (see Remove below).


Right-clicking and choosing Show reveals minor differences in any two or more texts:



similar_texts_topB not highlighted

Right-clicking the latter window and choosing Highlight similarities gave this:



Angle-bracketed lines are not compared. In the rest of the text, identical lines get coloured yellow. Most of the rest of the text is coloured yellow. Here what has happened is that an editor has changed New Scotland Yard to its London HQ. It looks as if minor edits were performed but otherwise text 3 and 5 are duplicates.


Copy, Remove, Show, Keep

Right-clicking gives these options.duplicate_contents_right_click_menu

To view, choose Show in the right-click menu. Or double-click to open any file in Notepad. Copy produces a list of the results which you can paste elsewhere. Remove simply moves all the duplicates which are marked for deletion, to a sub-folder of where they were before, called "moved". To change which duplicate is kept, select the one you do want and press Keep.


tog_plus        How does it check duplication?


See also: duplicate file-names