Download Parser

Zoom Window Out
Larger Text | Smaller Text
Hide Page Header
Show Expanding Text
Print Topic
Send Mail Feedback
Share This Topic
Save Permalink URL

Navigation: » No topics above this level «

First Parse

The point of it

The first parse takes the downloaded texts and splits each one into its component stories, as well as parsing each story's structure and putting the relevant fields into headers.

How to do it

You will process your data folder by folder. There may be files in sub-folders of each of these input folders if you like, the program will process them automatically. Data in different languages must be in different folders.

The First Parse process will be carried out in either of two ways depending on the data you collected:

a)different search-term downloads in different folders

If you downloaded several different times each time looking for a different search-term then do this First Parse process several times, once for each search-term. Use one folder for each search-word-type of data. For example, there will be a folderful of 'climate change' data, and separate folders for 'greenhouse effect' or 'global warming', etc.

English

Search-word A folder and any sub-folders

Search-word B folder and sub-folders

etc.

Chinese

Search-word F folder and sub-folders

Search-word G folder and sub-folders

etc.

b)one or more search-term downloads all in the same folder

In this case you can run the First Parse process once for each language

English

data in folder X and any sub-folders

Chinese

data in folder Y and any sub-folders

For each run, choose

•Downloaded files folder

•Language

•Search-word

and press "Load download set". That will show you which downloaded files have been found. It'll check there aren't any cases where the exact same contents is found in two files (if so it'll offer to delete one). Then it'll list all the download files. Press "Process them".

The original articles are parsed looking for signs of the date, author etc. This a) guides the folder structure and file-names of the results, b) comprises header information in < > brackets at the top of each text file.

The Windows file-date of the output is also set to correspond to the publication date of its contents, if 1980 or later (cannot manage earlier).

Your original download text files will get converted to Unicode if not in Unicode already. (If they came from a Mac it is best to use the WordSmith Text Converter on them first. Mac computers usually add some unwanted files and folders.)

The parsing process uses the list of document fields listed in the search-settings when determining headers for each article.

If this process works satisfactorily, the list of downloaded files will be cleared ready for the next folder to be parsed.

Text by Mike Scott, Help system by Help&Manual

Please enable JavaScript to view this site.

Download Parser