HtmlGet is an interactive interface to the Telegraph Screen Scraper (TeSS). HtmlGet allows the user to execute a series of Html GET or POST requests to arrive at a page, and then to extract data out of the page using a TeSS screen scraper definition file. Once extracted, this data can be output to a CSV file.
There are two ways to start up the screen scraper application. The first is as an interactive command shell. To start the scraper application in this mode, use the runhget shell script.
The screen scraper can also be started as a network server. In this mode, commands are read from the network. After the execution of each command or command file, the scraped results, if any, are written to the network connection in CSV format. To run the scraper as a network server, use the runwqs shell script.
The interactive command shell can be configured to
use the GNU readline library so that users can scroll back and forth
through command history. In order to use this feature, the code from
the java-readline project on sourceforge must be installed on your
system.
See http://java-readline.sourceforge.net/
for installation information.
All of the commands specified in the following section can be included in command files to be executed later. Further, such command files can contain variable references which will be substituted at the time the commands in the file are actually run. A variable reference looks as follows: $ARG:#$.
The documentation for the Telegraph Screen Scraper can be found at http://telegraph.cs.berkeley.edu/tess
The files in the bin/WebFiles directory located under the TelegraphCQ installation directory show how the HtmlGet and TeSS can be used to obtain stock quotes from a web page. The file quotes.hget contains a series of HTMLGet commands. The file quotes.jsc contains the TeSS directives necessary to extract data from the returned web page.
This example has been integrated with TelegraphCQ and is ready for use after the 'make tcqdemosetup' command has been run to set up the demonstration streams.
To obtain a stock quote, try the query:
Select * from web.quotes w where w.symbol='ORCL';
COMMAND |
ARGS |
DESCRIPTION |
GET |
URL |
Retrieve the contents of the provided URL using the GET method of encoding the query portion of the URL. This means that query keys and values will be encoded in the URL itself that is send to the server. |
POST |
URL |
Retrieve the contents of the provided URL using the POST method of encoding the query portion of the URL. This means that query keys and values will be sent to the server separately |
FILE |
Filename,arg[,...] |
This command executes a text file that contains other HTMLGet commands. The arguments after the first one represent the actual values for the variables which are referenced in the file.
For instance, the command file foo,arg1,arg2
Will run the commands in the file foo and substitute the value arg1 for each instance of $ARG:1$ and the value arg2 for each instance of $ARG:2$ |
Runtess |
UseCurrent, Isnew, jscfile, [args,...] |
Run TeSS. The arguments to this command indicate exactly where the page source comes from, and how the results are stored.
The arguments are: UseCurrent - the value of this argument is either the string true or the string false. If the value is true, the current HTMLGet page will be used as input to the screenscraper. If the argument is FALSE, then the host and url in the JSC file will be used to retrieve the source page. IsNew - this argument is either true of false. If false the results extracted from this run of TeSS will be added to the current set of results. Jscfile the path to a TeSS .jsc file which will be used to extract results from the page. Args, ... - TeSS wrappers may have arguments associated with them. If the value of UseCurrent is false, then the values provided here are passed to TeSS as the wrapper's set of input values.
|
Crawlandscrape |
Jscfile, query_contains, query_does_not_contain, Linktext |
This command is meant to scrape results off of a series of identical result pages. The JSC file is used to run TeSS on each of the result pages and extract values which are added to the current set of scraped results.
The JSCfile is a required argument. Any of the other arguments can be omitted by using the string ÒnullÓ as the value for that argument.
The crawlandscrape command determines which links in a page point at more result pages in one of the following ways: 1) the displayed link text matches the linktext regular expression OR 2) the URL in the link matches that of the original page AND the query parameters match both the query_contains and query_does_not_contain regular expressions if they have been specified. |
Showpage |
|
Take the current page, and render it using java's html display capability. Some pages may not display correctly or at all. |
Hidepage |
|
Remove the display of the page |
Dumppage |
|
Print the text of the current page to the standard output |
Currenturl |
|
Display the url of the current page |
Savehistory |
Filename |
Write the history of visited urls to a file |
Getselection |
|
Provided that the current page is rendered using showpage, this command will display a HTML document which contains only the selected portion of the document. |
Getforms |
|
Process the current page and extract the forms from it. |
Listforms |
|
List the names of all forms on the current page. This command must be called after getforms |
Listformproperties |
Formname |
List all the properties of the given form along with the current default values. |
Submitform |
Formname, args |
Submit a form. All arguments after the name of the form are optional and will override the default values for the corresponding form element. |
Resultstodb |
Table, jdbc-url,user,password |
Take the current results, and place them in a database |
Resultstocsv |
File |
Print the screenscraping results to a file
or to the screen in comma separated value format.
|
resulttojava |
classname |
This command allows a
callout to a java class. The class must: 1. implement the Runnable interface 2. have a constructor which takes (java.util.Vector, HtmlGet.HtmlGet) as its arguments. The class will be passed the current TeSS screenscraping results as a vector of Object arrays, and the current instance of the HtmlGet class. With these two arguments, the java code will be able to access all state of the current session, and programaticly alter the session by calling the methods of the HTMLGet class. |
Justtags |
TagRE[,...] |
This command takes a list of regular expressions which match HTML start element tags. The current page will be filtered such that after this command runs, the page will only contain these tags.
For example: justtags <INPUT
Will filter the document such that it contains only HTML input tags.
NOTE: the result of the filtering process may not leave a result that is itself a valid HTML page. |
Betweentags |
StartRE,endRE,[startREendRE] |
Extract the contents of a page between sets
of tags. The resulting document will also be marked up with comments
which describe what ÒlevelÓ in the original document
the results came from.
|