The purpose is to check whether one or more of your text files in your corpus doesn't belong. This could be because
• | it has got corrupted so what used to be good text is now just random characters or has got cut much shorter because of disk problems |
• | it isn't even in the same language as the rest of the corpus |
The tool works in any language. It does it by using a known sample of good text (in whatever language) and comparing that good text with all your corpus.
See also : How to do it
Page url: http://www.lexically.net/downloads/version5/HTML/?aimofcorpuscorruptiondetector.htm