Folders in this program
When you run the program you will get some sub-folders created.
These include
FILTERS |
used for storing any filtered articles |
SUB-CORPORA |
where sub-corpora you create get stored |
HEADLINES |
as the name suggests where headline data gets stored |
AUTHOR_PUBL_DATA |
where information about authors and publications get stored |
TEMPORARY |
for stuff you don't want the program to peruse |
The program will avoid these folders and their sub-folders when scanning for data -- but will search other folders created according to your search-terms.
Headlines are identified by looking for the end of the first sentence in the text (excluding all header fields), and then taking either the whole of that sentence, or up to a line-break, which ever comes first.
In this example, the source text has Europa, allein zu Haus; Weder die sowjetische Bedrohung noch die amerikanische Partnerschaft sind heutzutage sinnstiftend. Die EU muss sich aus eigener Kraft Zusammenhalt geben und Richtung. Zweifel ist angebracht
but the headline is interpreted as only the first line
<HEADLINE> Europa, allein zu Haus; </HEADLINE>
because of the line-break. Sentences are identified simply by the presence of potential sentence-enders such as ! ? or . and a following capital letter.
|
output_folder\LANGUAGE\SW\YYYY\MM\ where UK= 2-letter country code such as UK, US, FR, DE, BR, IN SW= 2-letter search-word code such as cc (climate change), gw (global warming), ge (greenhouse effect) YYYY=year in 4 digits MM=month in 2 digits |
PPPPP_AAAAA_DD_MM_YYYY_UU_SW_ZZZZZZZZ.TXT where P = publication, A = author, D=day, M=month, Y=year,
Example: 00001_00004_06_10_2009_41_CC_6C816C38.TXT
publication code 1, author code 4, date 6th October 2009 41 = there exist exactly four copies of this text (same day, author, same headers and text etc) and this one is the first noticed. CC= "climate change" search-word code checksum 6C816C38 |
Text by Mike Scott, Help system by Help&Manual