You can specify a document header to be routinely cut out: standard text at the beginning of each file in a corpus. For example, each BNC text file has a long section starting with <teiHeader> and ending with </teiHeader> . Or a copyright notice.

The process cuts by looking for the Document header ends mark-up and deleting all text to that point.




