Download Parser

Share This Topic
WordSmith site
Support
Purchasing
Research
Resources

Zoom Window Out
Larger Text | Smaller Text
Hide Page Header
Show Expanding Text
Print Topic
Send Mail Feedback
Share This Topic
Save Permalink URL

Navigation: » No topics above this level «

Reference

Folders in this program

When you run the program you will get some sub-folders created.

These include

FILTERS	used for storing any filtered articles
SUB-CORPORA	where sub-corpora you create get stored
HEADLINES	as the name suggests where headline data gets stored
AUTHOR_PUBL_DATA	where information about authors and publications get stored
TEMPORARY	for stuff you don't want the program to peruse

The program will avoid these folders and their sub-folders when scanning for data -- but will search other folders created according to your search-terms.

Output file format

At the top of each text you will find a number of headers depending on the text downloaded. After that the main story. All header information is surrounded by angle-brackets.

In this English example, we see there is a publication date earlier than the "load-date", and 2 copyright statements. There is a short headline.

<305 of 2173 DOCUMENTS THIS DOWNLOAD>

<SOURCE: BRISTOL EVENING POST>

<DATE: APRIL 25, 2008 FRIDAY>

<SECTION: Pg. 17>

<LENGTH: 246 words>

<LOAD-DATE: April 26, 2008>

<LANGUAGE: ENGLISH>

<PUBLICATION-TYPE: Newspaper>

</HEADER>

Changing climate is a challenge

</HEADLINE>

<STORY>

...

</STORY>

In this German example, as in the English example, we get the word length of the story (663 words), a publication date and an "update" and various other codes.

<SOURCE: DIE WELT>

<DATE: MONTAG 9. NOVEMBER 2015>

<AUTOR: John Kornblum>

<RUBRIK: FORUM; Gastkommentar; S. 2 Ausg. 261>

<LÄNGE: 663 Wörter>

<UPDATE: 9. November 2015>

<SPRACHE: GERMAN; DEUTSCH>

<GRAFIK: Amin Akhtar>

<PUBLICATION-TYPE: Zeitung>

<ZEITUNGS-CODE: WE>

</HEADER>

Weiter so, Angela Merkel!

</HEADLINE>

<STORY>

Headlines

Headlines are identified by looking for the end of the first sentence in the text (excluding all header fields), and then taking either the whole of that sentence, or up to a line-break, which ever comes first.

In this example, the source text has

Europa, allein zu Haus;

Weder die sowjetische Bedrohung noch die amerikanische Partnerschaft sind

heutzutage sinnstiftend. Die EU muss sich aus eigener Kraft Zusammenhalt geben

und Richtung. Zweifel ist angebracht

but the headline is interpreted as only the first line

Europa, allein zu Haus;

</HEADLINE>

because of the line-break. Sentences are identified simply by the presence of potential sentence-enders such as ! ? or . and a following capital letter.

Folder format

output_folder\LANGUAGE\SW\YYYY\MM\

where

UK= 2-letter country code such as UK, US, FR, DE, BR, IN

SW= 2-letter search-word code such as cc (climate change), gw (global warming), ge (greenhouse effect)

YYYY=year in 4 digits

MM=month in 2 digits

File-name format

PPPPP_AAAAA_DD_MM_YYYY_UU_SW_ZZZZZZZZ.TXT

where

P = publication, A = author, D=day, M=month, Y=year,
SW = search-word code
Z is a checksum referring to the text of each story
U concerns duplicates

Example:

00001_00004_06_10_2009_41_CC_6C816C38.TXT

publication code 1,

author code 4,

date 6th October 2009

41 = there exist exactly four copies of this text (same day, author, same headers and text etc) and this one is the first noticed.

CC= "climate change" search-word code

checksum 6C816C38

Duplicates

Here is a sample from two "different-but-the-same" texts. The headers show the lengths are very similar (682 or 686 words) but they have been published on different dates and in publications with somewhat differing names. The program looks at the header information and compares that for each text, so it determined that they differed. However a read of the two suggests they were slightly edited variants.

ONE

<942 of 1090 DOCUMENTS>

<SOURCE: Europolitique>

<DATE: 4 septembre 2008>

<LONGUEUR: 682 mots>

<DATE-CHARGEMENT: 3 septembre 2008>

<LANGUE: FRENCH; FRANÇAIS>

<TYPE-PUBLICATION: Journal>

</HEADER>

RECHERCHE : LA COMMISSION VEUT INTÉGRER RECHERCHE MARINE ET MARITIME .

RUBRIQUE: No. 3588

Des activités de recherche bien ciblées, cohérentes et intégrées peuvent

contribuer à assurer la croissance des secteurs maritimes et des régions

côtières tout en veillant à la protection du milieu marin. La stratégie adoptée

le 3 septembre par la Commission européenne(1) propose des moyens pour mieux

intégrer la recherche marine (connaissance et protection du milieu marin) et la

recherche maritime (technologies : énergie, transport, utilisation des

ressources), l'objectif étant d'en améliorer l'efficacité par la mise en commun

des connaissances et des ressources et par la mise en place d'un partenariat

entre monde scientifique, industrie et acteurs institutionnels dont les régions

côtières. Elle constitue le volet « recherche » de la politique maritime

intégrée pour l'Union européenne de 2007.

CAPACITÉS RENFORCÉES

(continues with 5 more paragraphs)

THE OTHER

<459 of 1090 DOCUMENTS>

<SOURCE: Europolitique Environnement (Français)>

<DATE: 18 Septembre 2008>

<AUTEUR: Anne Eckstein>

<LONGUEUR: 686 mots>

<DATE-CHARGEMENT: 25 Novembre 2008>

<LANGUE: FRENCH; FRANÇAIS>

<TYPE-PUBLICATION: Journal>

</HEADER>

RECHERCHE : LA COMMISSION VEUT INTÉGRER RECHERCHE MARINE ET MARITIME

RUBRIQUE: N° 0752