The point of it
News texts are often edited slightly and re-published the same day or later or in a sister publication. For example if a name or title is not correct, or if details need changing.
The idea is to be able to check whether you have virtually duplicate text files, i.e. with the same or similar contents. This search compares all text files with each other regardless of their file-names. If duplicates are found you can move and rename them so they can be avoided in further research.
How to do it
Specify your Folder to search and file-type(s). Check the difference settings, then simply press "Search". Search will go through that folder and any sub-folders and will report any duplicates found. Note: it will take time, as it needs to compare each text file with all the others in those folders. The comparison checks the vocabulary used in each pair of texts.

Here the status bar shows 92 clusters of duplicates have been found and the process is now working on text file 3985 (of over 7,000) text files. (A cluster is a text with at least one other that has duplicate contents.) The process will finish in about 16 minutes.
Settings
An intrinsically slow process can be speeded up by reducing the number of text files which get compared with each other.
Max length diff %: how much difference in length will you accept in a comparison? The bigger any length differences the slower the process. In the case above a setting of 15.0% was used, so any a text of exactly 100Kb in size would only get compared with others between 85 and 115Kb in size.
Min. match-strings:= the program works by seeking matching strings in the texts. This records how many (out of 10) must be found. 5 is a reasonable number. The chance of a long text string (starting x characters into one text and exactly y characters long, including punctuation) being found in another text is low.
Check dates: if this is selected, you can decide how many days apart text files can be. (It makes a huge difference to speed. You will be asked to set this if the number of days to check is large. A reasonable number of days for news text is 15 as typically a publisher republishes a story with some changes within 2 weeks.)
Same folder: if this is checked, only text files in the same folder get compared..
Pause the process? Well, you can stop it (press the red exclamation mark).
|
Results
After sorting by pressing Matches, the top duplicate clusters are these. The first has a text originally published on 14 December and republished on 16 December.

After right-clicking and choosing Text Similarity, we see this:

The texts differ in formatting and minor edits.

Looking at the texts with only 4 out of 10 matches, we again see they vary slightly in format and date and file-size.

Save, Move, Show
When the search finishes you can press Move to filter duplicates to a moved sub-folder.
For example, J:\mycorpus\study3\section6\table3.txt, if it's to be moved, will be placed in J:\mycorpus\study3\section6\moved-contents and saved there as table3.moved_contents.
Use the Restore any moved button to go through all the set, putting any moved back where they belong. (You may have to close the Corpus Checker first.)
How does it check duplication?
The assumption underlying this procedure is that a re-edited text contains many chunks of matching strings of characters. No attempt is made to match word meanings.
The process goes through all text files in the folder chosen plus any sub-folders of it, and compares each one with all the remaining text files. If differences in both files' sizes are within the max. length difference (and optional date or folder settings), at any moment two text files may be compared. A a set of ten strings of 50 characters taken from different parts of text 1 is checked to see how many of the set are found in the text 2. If enough of these 10 strings are found, the pair are identified as duplicates.
This procedure is fairly thorough but will take time because so many text files get compared. A time estimate is shown.
The date comparison uses file dates (which you can set appropriately for text files).
To highlight similarities, the procedure is to take each line of text in one of a pair (unless it's angle-bracketed) and see whether it is also found in the other. If so, it'll get coloured yellow.
|
See also: duplicate file-names, text similarity, count files.