HtmlGet and WebQueryServer: an interactive interface to the Telegraph Screen Scraper

HtmlGet and WebQueryServer

HtmlGet is an interactive interface to the Telegraph Screen Scraper (TeSS). HtmlGet allows the user to execute a series of Html GET or POST requests to arrive at a page, and then to extract data out of the page using a TeSS screen scraper definition file. Once extracted, this data can be output to a CSV file.

Starting up HtmlGet

There are two ways to start up the screen scraper application. The first is as an interactive command shell. To start the scraper application in this mode, use the runhget shell script.

The screen scraper can also be started as a network server. In this mode, commands are read from the network. After the execution of each command or command file, the scraped results, if any, are written to the network connection in CSV format. To run the scraper as a network server, use the runwqs shell script.

Configuring readline

The interactive command shell can be configured to use the GNU readline library so that users can scroll back and forth through command history. In order to use this feature, the code from the java-readline project on sourceforge must be installed on your system.
See http://java-readline.sourceforge.net/ for installation information.

Command Files

All of the commands specified in the following section can be included in command files to be executed later. Further, such command files can contain variable references which will be substituted at the time the commands in the file are actually run. A variable reference looks as follows: $ARG:#$.

TeSS

The documentation for the Telegraph Screen Scraper can be found at http://telegraph.cs.berkeley.edu/tess

Example

The files in the bin/WebFiles directory located under the TelegraphCQ installation directory show how the HtmlGet and TeSS can be used to obtain stock quotes from a web page. The file quotes.hget contains a series of HTMLGet commands. The file quotes.jsc contains the TeSS directives necessary to extract data from the returned web page.

This example has been integrated with TelegraphCQ and is ready for use after the 'make tcqdemosetup' command has been run to set up the demonstration streams.

To obtain a stock quote, try the query:

Select * from web.quotes w where w.symbol='ORCL';

HTMLGet/WebQueryServer Command Summary

COMMAND	ARGS	DESCRIPTION
GET	URL	Retrieve the contents of the provided URL using the GET method of encoding the query portion of the URL. This means that query keys and values will be encoded in the URL itself that is send to the server.
POST	URL	Retrieve the contents of the provided URL using the POST method of encoding the query portion of the URL. This means that query keys and values will be sent to the server separately
FILE	Filename,arg[,...]	This command executes a text file that contains other HTMLGet commands. The arguments after the first one represent the actual values for the variables which are referenced in the file. For instance, the command file foo,arg1,arg2 Will run the commands in the file foo and substitute the value arg1 for each instance of $ARG:1$ and the value arg2 for each instance of $ARG:2$
Runtess	UseCurrent, Isnew, jscfile, [args,...]	Run TeSS. The arguments to this command indicate exactly where the page source comes from, and how the results are stored. The arguments are: UseCurrent - the value of this argument is either the string true or the string false. If the value is true, the current HTMLGet page will be used as input to the screenscraper. If the argument is FALSE, then the host and url in the JSC file will be used to retrieve the source page. IsNew - this argument is either true of false. If false the results extracted from this run of TeSS will be added to the current set of results. Jscfile the path to a TeSS .jsc file which will be used to extract results from the page. Args, ... - TeSS wrappers may have arguments associated with them. If the value of UseCurrent is false, then the values provided here are passed to TeSS as the wrapper's set of input values.
Crawlandscrape	Jscfile, query_contains, query_does_not_contain, Linktext	This command is meant to scrape results off of a series of identical result pages. The JSC file is used to run TeSS on each of the result pages and extract values which are added to the current set of scraped results. The JSCfile is a required argument. Any of the other arguments can be omitted by using the string ÒnullÓ as the value for that argument. The crawlandscrape command determines which links in a page point at more result pages in one of the following ways: 1) the displayed link text matches the linktext regular expression OR 2) the URL in the link matches that of the original page AND the query parameters match both the query_contains and query_does_not_contain regular expressions if they have been specified.
Showpage		Take the current page, and render it using java's html display capability. Some pages may not display correctly or at all.
Hidepage		Remove the display of the page
Dumppage		Print the text of the current page to the standard output
Currenturl		Display the url of the current page
Savehistory	Filename	Write the history of visited urls to a file
Getselection		Provided that the current page is rendered using showpage, this command will display a HTML document which contains only the selected portion of the document.
Getforms		Process the current page and extract the forms from it.
Listforms		List the names of all forms on the current page. This command must be called after getforms
Listformproperties	Formname	List all the properties of the given form along with the current default values.
Submitform	Formname, args	Submit a form. All arguments after the name of the form are optional and will override the default values for the corresponding form element.
Resultstodb	Table, jdbc-url,user,password	Take the current results, and place them in a database
Resultstocsv	File	Print the screenscraping results to a file or to the screen in comma separated value format.
resulttojava	classname	This command allows a callout to a java class. The class must: 1. implement the Runnable interface 2. have a constructor which takes (java.util.Vector, HtmlGet.HtmlGet) as its arguments. The class will be passed the current TeSS screenscraping results as a vector of Object arrays, and the current instance of the HtmlGet class. With these two arguments, the java code will be able to access all state of the current session, and programaticly alter the session by calling the methods of the HTMLGet class.
Justtags	TagRE[,...]	This command takes a list of regular expressions which match HTML start element tags. The current page will be filtered such that after this command runs, the page will only contain these tags. For example: justtags <INPUT Will filter the document such that it contains only HTML input tags. NOTE: the result of the filtering process may not leave a result that is itself a valid HTML page.
Betweentags	StartRE,endRE,[startREendRE]	Extract the contents of a page between sets of tags. The resulting document will also be marked up with comments which describe what ÒlevelÓ in the original document the results came from.