WordSmith Tools Manual

Zoom Window Out
Larger Text | Smaller Text
Hide Page Header
Show Expanding Text
Printable Version
Share This Topic
Save Permalink URL

Navigation: Utility Programs > Corpus Checker

Aim

WSImage_119_CorpusCheck

The purpose is to check your corpus for corruption, relevance and duplicates.

Examples:

•it has got corrupted so what used to be good text is now just random characters or has got cut much shorter because of disk problems

•it isn't even in the same language as the rest of the corpus

•it is a copy of another text file or has very similar text

•it contains too much boilerplate text

•it is or is not relevant to a particular area of enquiry

The tool works in any language. It checks corruption by using a known sample of good text (in whatever language) and comparing that good text with all your corpus.