Please enable JavaScript to view this site.

Navigation: » No topics above this level «

Duplicates

Scroll Prev Top Next More

The point of it

Duplicates can arise where there are two files with the same day/month/year, same author and same publication. This happens when the provider gives you all files on 'climate change' and on 'global warming' in two downloads: the same file can very often come in twice. Or else the contents are duplicates with some very minor alteration in terms of punctuation or a slight change in wording, or alternatively just a coincidence that the same author wrote more than one article that day in the same publication. The program attempts to distinguish these by examining the headers, the headlines and the text and will only overwrite where there seem to be identical contents. (See the Reference section for a case where there is great overlap.)

 

How to do it

Press the Find & Mark Corpus Duplicates button. This takes the files stemming from the parsed files folder and finds duplicates where the same story, author, date and publication were found in different folders (e.g. one in a climate change folder and another in the global warming folder). They get marked thus

00001_00004_06_10_2009_21_CC_05AD269D.TXT

00001_00004_06_10_2009__2_GW_05AD269D.TXT

 

In this example the 2 means the same text is found in 2 categories. The 1 means the CC one was the first encountered (all others get _).

You may if you wish omit these duplicates. That is, you will include one of the copies, the one with the _21_, but not the others.

You can export lists of your duplicates or non-duplicates, or duplicates with full details here too. A list of all except duplicates can be used as input to WordSmith for file choosing. Duplicates with full details gives you an idea of which files overlapped so you can check. Only redundant duplicates means those which can safely be deleted as their contents are the same as some other file (00001_00004_06_10_2009_2__GW_05AD269D.TXT is an example); duplicates redundant or not is all files which have matches (both files in the example above).

 

Text by Mike Scott, Help system by Help&Manual

>