The point of it
News texts are often edited slightly and re-published the same day or later or in a sister publication. For example if a name or title is not correct, or if details need changing.
The idea is to be able to check whether you have virtually duplicate text files, i.e. with the same or similar contents. This search compares all text files with each other regardless of their file-names. If duplicates are found you can move and rename them so they can be avoided in further research.
How to do it
Specify your Folder to search and file-type(s). Check the difference settings, then simply press "Search". Search will go through that folder and any sub-folders and will report any duplicates found. Note: it will take time, as it needs to compare each text file with all the others in those folders. The comparison checks the vocabulary used in each pair of texts.
Here the status bar shows 92 clusters of duplicates have been found and the process is now working on text file 3985 (of over 7,000) text files. (A cluster is a text with at least one other that has duplicate contents.) The process will finish in about 16 minutes.
Results
After sorting by pressing Matches, the top duplicate clusters are these. The first has a text originally published on 14 December and republished on 16 December.
After right-clicking and choosing Text Similarity, we see this:
The texts differ in formatting and minor edits.
Looking at the texts with only 4 out of 10 matches, we again see they vary slightly in format and date and file-size.
Save, Move, Show
When the search finishes you can press Move to filter duplicates to a moved sub-folder.
For example, J:\mycorpus\study3\section6\table3.txt, if it's to be moved, will be placed in J:\mycorpus\study3\section6\moved-contents and saved there as table3.moved_contents.
Use the Restore any moved button to go through all the set, putting any moved back where they belong. (You may have to close the Corpus Checker first.)
How does it check duplication?
See also: duplicate file-names, text similarity, count files.