TeSS and HtmlGet

TeSS is the Telegraph Screen Scraper.  It uses user-specified regular expressions to extract data from web pages into a structured data tuples.
 

HtmlGet is an interactive interface to the Telegraph Screen Scraper (TeSS).   HtmlGet maintains a copy of the current web page, and a current set of result items extracted via TeSS.  HtmlGet  commands allow the user to use http GET or POST requests to set the current page, and then to extract data out of the page using a TeSS screen scraper definition file.  Once extracted, this data can be output to a CSV file, to a postgres database, or processed further using user defined java routines.
 
Starting up HtmlGet
 
A shell script runhtmlget.sh will start up the HtmlGet command shell
 
 
Using Readline in the shell
 
The interactive command shell can be configured to use the GNU readline library so that users can scroll back and forth through command history. In order to use this feature, the code from the java-readline project on sourceforge must be installed on your system.
See http://java-readline.sourceforge.net/ for installation information.
 
Command Files
 
All of the commands specified in the following section can be included in command files to be executed later. Further, such command files can contain variable references which will be substituted at the time the commands in the file are actually run. A variable reference looks as follows: $ARG:#$.
 
Example
 
The files in the bin/WebFiles directory located under the TelegraphCQ installation directory show how the HtmlGet and TeSS can be used to obtain stock quotes from a web page. The file quotes.hget contains a series of HTMLGet commands. The file quotes.jsc contains the TeSS directives necessary to extract data from the returned web page.
 

To run the example:

Another file quotes2.hget in the WebFiles directory gets the quote using a sequence of HtmlGet commands to retrieve the form,  extract it, and submit it.


 
Acknowledgements

TeSS and HtmlGet now use modified W3C Jigsaw client libraries to contact remote servers.  This allows TeSS to support http cookies.




HTMLGet Command Summary

 

COMMAND

ARGS

DESCRIPTION

GET

URL

Retrieve the contents of the provided URL using the GET method of encoding the query portion of the URL. This means that query keys and values will be encoded in the URL itself that is send to the server.

POST

URL

Retrieve the contents of the provided URL using the POST method of encoding the query portion of the URL. This means that query keys and values will be sent to the server separately

FILE

Filename,arg[,...]

This command executes a text file that contains other HTMLGet commands. The arguments after the first one represent the actual values for the variables which are referenced in the file.

 

For instance, the command file foo,arg1,arg2

 

Will run the commands in the file foo and substitute the value arg1 for each instance of $ARG:1$ and the value arg2 for each instance of $ARG:2$

Runtess

UseCurrent, Isnew, jscfile, [args,...]

Run TeSS. The arguments to this command indicate exactly where the page source comes from, and how the results are stored.

 

The arguments are:

UseCurrent - the value of this argument is either the string true or the string false. If the value is true, the current HTMLGet page will be used as input to the screenscraper. If the argument is FALSE, then the host and url in the JSC file will be used to retrieve the source page.

IsNew - this argument is either true of false. If false the results extracted from this run of TeSS will be added to the current set of results.

Jscfile  the path to a TeSS .jsc file which will be used to extract results from the page.

Args, ... - TeSS wrappers may have arguments associated with them. If the value of UseCurrent is false, then the values provided here are passed to TeSS as the wrapper's set of input values.

Crawlandscrape

Jscfile, query_contains, query_does_not_contain,

Linktext

This command is meant to scrape results off of a series of identical result pages. The JSC file is used to run TeSS on each of the result pages and extract values which are added to the current set of scraped results.

 

The JSCfile is a required argument. Any of the other arguments can be omitted by using the string ÒnullÓ as the value for that argument.

 

The crawlandscrape command determines which links in a page point at more result pages in one of the following ways:

1)  the displayed link text matches the linktext regular expression

OR

2) the URL in the link matches that of the original page AND the query parameters match both the query_contains and query_does_not_contain regular expressions if they have been specified.

Showpage

 

Take the current page, and render it using java's html display capability. Some pages may not display correctly or at all.

Hidepage

 

Remove the display of the page

Dumppage

 

Print the text of the current page to the standard output

Currenturl

 

Display the url of the current page

Savehistory

Filename

Write the history of visited urls to a file

Getselection

 

Provided that the current page is rendered using showpage, this command will display a HTML document which contains only the selected portion of the document.

Getforms

 

Process the current page and extract the forms from it.

Listforms

 

List the names of all forms on the current page. This command must be called after getforms

Listformproperties

Formname

List all the properties of the given form along with the current default values.

Submitform

Formname, args

Submit a form. All arguments after the name of the form are optional and will override the default values for the corresponding form element.

Resultstodb

Table, jdbc-url,user,password

Take the current results, and place them in a database

Resultstocsv

File

Print the screenscraping results to a file or to the screen in comma separated value format.               

resulttojava
classname
  This command allows a callout to a java class.  The class must:
   1. implement the Runnable interface
   2. have a constructor which takes (java.util.Vector, HtmlGet.HtmlGet) as its arguments.

The  class will be passed the current TeSS screenscraping results as a vector of Object arrays, and the current instance of the HtmlGet class. 

With these two arguments, the java code will be able to access all state of the current session, and programaticly alter the session by calling the methods of the HTMLGet class.

Justtags

TagRE[,...]

This command takes a list of regular expressions which match HTML start element tags. The current page will be filtered such that after this command runs, the page will only contain these tags.

 

For example: justtags <INPUT

 

Will filter the document such that it contains only HTML input tags.

 

NOTE: the result of the filtering process may not leave a result that is itself a valid HTML page.

Betweentags     

StartRE,endRE,[startREendRE]

Extract the contents of a page between sets of tags. The resulting document will also be marked up with comments which describe what ÒlevelÓ in the original document the results came from.