similar content

The point of it

News texts are often edited slightly and re-published the same day or later or in a sister publication. For example if a name or title is not correct, or if details need changing.

The idea is to be able to check whether you have virtually duplicate text files, i.e. with the same or similar contents. This search compares all text files with each other regardless of their file-names. If duplicates are found you can move and rename them so they can be avoided in further research.

How to do it

Specify your Folder to search and file-type(s). Check the difference settings, then simply press "Search". Search will go through that folder and any sub-folders and will report any duplicates found. Note: it will take time, as it needs to compare each text file with all the others in those folders. The comparison checks the vocabulary used in each pair of texts.

dupl_contents_search

Here the status bar shows 92 clusters of duplicates have been found and the process is now working on text file 3985 (of over 7,000) text files. (A cluster is a text with at least one other that has duplicate contents.) The process will finish in about 16 minutes.

Settings

Results

After sorting by pressing Matches, the top duplicate clusters are these. The first has a text originally published on 14 December and republished on 16 December.

dupl_contents_many_matches

After right-clicking and choosing Text Similarity, we see this:

dupl_contents_in Text Similarity

The texts differ in formatting and minor edits.

dupl_contents_fewer_matches

Looking at the texts with only 4 out of 10 matches, we again see they vary slightly in format and date and file-size.

dupl_contents_in Text Similarity 2

Save, Move, Show

When the search finishes you can press Move to filter duplicates to a moved sub-folder.

For example, J:\mycorpus\study3\section6\table3.txt, if it's to be moved, will be placed in J:\mycorpus\study3\section6\moved-contents and saved there as table3.moved_contents.

Use the Restore any moved button to go through all the set, putting any moved back where they belong. (You may have to close the Corpus Checker first.)

How does it check duplication?

This Info:	ALT+q
Nav Header:	ALT+n
Page Header:	ALT+h
Topic Header:	ALT+t
Topic Body:	ALT+b
Exit Menu/Up:	ESC

Please enable JavaScript to view this site.

WordSmith Tools Help

Keyboard Navigation