In the following example we will wrap the results of a simple Google(C) search.
<form action="/search" method=get name=f> <table cellspacing=0 cellpadding=0 ><tr align=center valign=baseline> <td width=75> </td><td nowrap><font face=arial,sans-serif size=-1> Search 1,326,920,000 web pages</font> </td><td></td></tr> <tr align=center valign=middle><td width=75> </td><td align=center> <input type=text value="" framewidth=4 name=q size=55><br> <input name=btnG type=submit value="Google Search"> <input name=btnI type=submit value="I'm Feeling Lucky"> </td><td nowrap valign=top align=left><font face=arial,sans-serif size=-2> <a href="/advanced_search">Advanced Search</a><br> <a href="/preferences">Preferences</a></font></td></tr></table></form> |
![]() |
The first thing to notice is the action tag. This tells you what page the form is going to contact when you submit the query.
In this case the action tag points to /search
So in our wrapper file we would add host=www.google.com (since there is no http:// at the start of the action the browser will use the current page as the base and expand the action into http://www.google.com/search) and url=/search?.
host=www.google.com url=/search?
Now we need to figure out what method this form uses to send the data to the server. The two possibilities in a form are get and post. You can look at the method tag to figure out which one to use. This form happens to use get. See the documentation on query types for more infomation. So now our wrapper looks like:
host=www.google.com url=/search? method=get
Now we need to figure out what parameters the query is expecting. To do this find all the input tags in the form and look at thier values. Lucky for this form has only one input we need to worry about. Inputs of type submit and reset you don't need to worry about since they just tell the browser to render a button that submits the query or resets the form and they are not parameters to the query. See the documentation about different input tags for more information. In our query we only need to concern ourselves with the line:
<input type=text value="" framewidth=4 name=q size=55><br>The important fields here are the value and name fields. The value is the default value for the input, the name is the name of the parameter.
So what we have is a form with one parameter named q that is a text string that defaults to the empty string. To put this the wrapper we would add:
arguments=q, default1=Note that the default does not have to be the same as the form default. If you wanted to default to searching for Blue Meanies you could write
arguments=q, default1=Blue Meaniesand you would have no problems. If we had more arguments (say r and s) we would write (Assuming we wanted the given defaults):
arguments=q,r,s, default1=Blue Meanies default2=Moog Cookbook default3=NOFXand q would default to Blue Meanies, s to Moog Cookbook and so on. So now our wrapper looks like:
host=www.google.com url=/search? method=get arguments=q, default1=
Now we need to tell TESS what results to expect. This is done with the columns tag. For this example we will get the name of the site, the description, and the url. The ordering is important here, the columns need to be listed in the order they appear in the data set. To get this information you'll probably need to run a query.
So now we have:
host=www.google.com url=/search? method=get arguments=q, default1= columns=name,description,url,
At this point you have fully specified everything TESS needs to know to find the data. Now it's on to telling it how to put that data into tabular form. You'll need to run a sample query on the webpage now. Sometimes different queries can return results differently (for example Mirriam Websters dictionary search will return one type of result if you search for a word it knows about and fairly different kind if you search of one it doesn't know about) so you might need to run a number of sample queries
To run a sample query we will add the dump directive in our wrapper. This tells TESS to dump the html it gets back from running the query and not to try and return any real data. This can be usful both for row/column regular expresion writing but also for debuging the argument part of your script.
For our example lets run a query on "Telegraph". We will change our wrapper to look like:
host=www.google.com url=/search? method=get arguments=q, default1=telegraph columns=name,description,url, dump end
Looking at the top of the html that gets returned we see:
<html> <head><META HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO-8859-1"> <title>Google Search: telegraph </title> <style><!-- body {font-family: arial,sans-serif;} div.nav {margin-top: 1ex;} div.nav A,span.nav {font-size: 10pt; font-family: arial,sans-serif;} div.nav A,span.big {font-size: 12pt; color: #0000cc;} div.nav A {font-size: 10pt; color: black;} A.l:link {color: #6f6f6f;} //--></style> </head> <body bgcolor=#ffffff text=#000000 link=#0000cc vlink=#551A8B alink=#ff0000> <form method=GET action=/search><table border=0 cellpadding=2 cellspacing=0><tr><td rowspan=2><a href=/><img border=0 src=/intl/en_extra/images/Title_Left.gif width=200 height=78 alt="Google "></a></td><td nowrap><font face=arial,sans-serif size=-1> <A HREF=/advanced_search?q=telegraph&lr=&safe=off>Advanced Search</A> <A HREF=/preferences?q=telegraph&lr=&safe=off>Preferences</A> <A HREF=/intl/en_extra/help.html>Search Tips</A></font></td></tr><tr><td align=left valign=middle><INPUT type=text name=q size=31 maxlength=256 value="telegraph"> <INPUT TYPE=hidden name=lr value=><INPUT TYPE=hidden name=safe value=off><INPUT type=submit name=btnG VALUE="Google Search"> <INPUT type=submit name=btnI VALUE="I'm Feeling Lucky"><br></form> </td></tr></table> <table width=100% cellpadding=2 cellspacing=0 border=0><tr><td bgcolor=#3366cc align=left nowrap><font face=arial,sans-serif size=-1 color=white>Searched the web for <b><a href=/url?sa=X&q=http://www.dictionary.com/cgi-bin/dict.pl?term=telegraph title="Look up telegraph on dictionary.com"><font color=white>telegraph</font></a></b>. </td><td bgcolor=#3366cc align=right nowrap><font face=arial,sans-serif size=-1 color=white>Results <b>1 - 10</b> of about <b>745,000</b>. Search took <b>0.07</b> seconds.</font></td></tr></table><br><table border=0 cellpadding=1 cellspacing=0 width=100%><tr><td nowrap><font color=#6f6f6f size=-1 face=arial,sans-serif>Categories:</font> <font size=-1 face=arial,sans-serif> <A href=http://directory.google.com/Top/Recreation/Antiques/Telephones_and_Telegraphs/Telegraphs/> Recreation > Antiques > Telephones and Telegraphs > Telegraphs</A> <A href=http://directory.google.com/Top/Recreation/Amateur_Radio/Boatanchors/Morse_Telegraphy/> Recreation > Amateur Radio > Boatanchors > Morse Telegraphy</A> </font></td></tr></table> <p><A HREF=http://www.telegraph.co.uk/>Electronic <b>Telegraph</b> | Front Page</A><font size=-1><br> <b>...</b> the first train operating company to lose its franchise since privatisation in the<br> mid-1990s. Yesterday in Parliament Daily <b>Telegraph</b>: Connex shunted off. <b>...</b> <br><font size=-1 color=#6f6f6f>Description:</font> Britain's biggest-selling broadsheet daily. <br><font size=-1 color=#6f6f6f>Category:</font> <a href=http://directory.google.com/Top/News/Newspapers/Regional/United_Kingdom/ class=l>News > Newspapers > Regional > United Kingdom</a><br><font color=green>www.telegraph.co.uk/ - 32k - <A HREF=/search?q=cache:www.telegraph.co.uk/+telegraph class=l>Cached</A> - <A HREF=/search?lr=&safe=off&num=10&q=related:www.telegraph.co.uk/ class=l>Similar pages</A></font></font><br> <BLOCKQUOTE> <p><A HREF=http://www.telegraph.co.uk/home.html>Electronic <b>Telegraph</b> | Front Page</A><font size=-1><br> <b>...</b> Feedback poll, See text menu at bottom of page Turner Prize 2000<br> Giorgio Armani <b>Telegraph</b> Money <b>...</b> Search Electronic <b>Telegraph</b> for. <b>...</b> <br><font color=green>www.telegraph.co.uk/home.html - 25k - <A HREF=/search?q=cache:www.telegraph.co.uk/home.html+telegraph class=l>Cached</A> - <A HREF=/search?lr=&safe=off&num=10&q=related:www.telegraph.co.uk/home.html class=l>Similar pages</A></font> <br>[ <A HREF=/search?lr=&safe=off&q=site:www.telegraph.co.uk+telegraph class=l>More results from www.telegraph.co.uk</A> ] </font><br> </BLOCKQUOTE>
![]() |
Ugg, kinda messy. This is what we need to parse through and extract meaningful data out of. Note that I've only displayed the HTML for the first two results here but we exploit the fact that most web-based searches return each result in a very similar fashion. It is often useful to use both the HTML and the actual display text. Sometimes certain words in the displayed results will jump out as obvious delimiters, other times certain html tags will be the clear delimiters. The first thing to look for is some prefix text. Something that signifies that meaningful data is about to follow. The word seconds (circled in red in the picture) seems to appear at the top of the data and will probably appear in all queries so that's what we'll use prefix=seconds Now we need to break the data into rows. To do this we need to write regular expresions that match only what we want to act as the beginings and ends of rows. In this case we notice by looking at the html that each new result is prefixed by a paragraph tag at the beginning of a line like <p> (In bold in the above text). We want to only match <p>s that are at the start of lines so we will use the ^ character in our regular expresion. So we have: rowprefix=^<p> It also appears that all results end with the text "Similar pages" (circled in red) so we can use that as our rowterm. rowterm=Similar pages |
host=www.google.com url=/search? method=get arguments=q, default1= columns=name,description,url, prefix=seconds rowprefix=^<p> rowterm=Similar pages
So now we have the google data broken down into rows. All that's left to do is to extract the columns from those rows. To do this let's look at a single row:
<p><A HREF=http://www.telegraph.co.uk/>Electronic <b>Telegraph</b> | Front Page</A><font size=-1><br> <b>...</b> the first train operating company to lose its franchise since privatisation in the<br> mid-1990s. Yesterday in Parliament Daily <b>Telegraph</b>: Connex shunted off. <b>...</b> <br><font size=-1 color=#6f6f6f>Description:</font> Britain's biggest-selling broadsheet daily. <br><font size=-1 color=#6f6f6f>Category:</font> <a href=http://directory.google.com/Top/News/Newspapers/Regional/United_Kingdom/ class=l>News > Newspapers > Regional > United Kingdom</a><br><font color=green>www.telegraph.co.uk/ - 32k - <A HREF=/search?q=cache:www.telegraph.co.uk/+telegraph class=l>Cached</A> - <A HREF=/search?lr=&safe=off&num=10&q=related:www.telegraph.co.uk/ class=l>Similar pages
In bold is the data we wish to extract. What we need to do is write a set of prefixes and terminators, similar to how we did for rows, that will extract exactly the data we want. The way to think of this is sequentially. TESS will search through the row for the first column you specify. Once it's found that coloumn it adds it into the result set and keep searching for the next column in the rest of row text. It ignores anything before, or contained within, previous columns.
So what should we use in our example. It would appear that the name of the site is prefixed by a link to it. We can exploit this fact in writing our name_prefix. Since all the varies in a url tag is the url itself we can just write a url matcher:
<A HREF=[^>]*>should work just fine. (This looks for <A HREF= followed by anything that's not a greater than sign, followed by a greater than sign which is exactly what a link tag is.)
The termination is easy, it's just the end of the link tag, </A> So now we can write our name prefixes and terms:
name_prefix=<A HREF=[^>]*> name_term=</A>Tess will now continue searching for the description field in the rest of the row. It seems to be a safe bet that the description will start with the text Description: and end a with a <br&gr;. Similarly we can grab the url by using font color=green> as the prefix and a - as the term. So now our wrapper looks like:
host=www.google.com url=/search? method=get arguments=q, default1= columns=name,description,url, prefix=seconds rowprefix=^<p> rowterm=Similar pages name_prefix=<A HREF=[^>]*> name_term=</A> description_prefix=Description: description_term=<br> url_prefix=font color=green> url_term=-We're almost done! If you look at all the results you will notice that sometimes there is no description of the site. If we were just to leave the wrapper as is TESS would complain because by default it expects to always find all columns. We have to tell it that in this case it's okay for the description to be missing. To do this we use the nullifmissing directive.
description_nullifmissingWe also need to tell tess where it can stop looking for more rows with the termination tag. By looking at the end of the results that google returned:
<p><A HREF=http://www.nbnews.com/telegraphjournal/>The New Brunswick <b>Telegraph</b> Journal</A><font size=-1><br> <b>...</b> Limbering up. Noel Chenier/ <b>Telegraph</b>-Journal Adam Mitton is framed by the legs<br> of a teammate as he stretches during a running practice in preparation for the <b>...</b> <br><font color=green>www.nbnews.com/telegraphjournal/ - 16k - <A HREF=/search?q=cache:www.nbnews.com/telegraphjournal/+telegraph class=l>Cached</A> - <A HREF=/search?lr=&safe=off&num=10&q=related:www.nbnews.com/telegraphjournal/ class=l>Similar pages</A></font></font><br><p><center> <div class=nav> <p><table border=0 cellpadding=0 width=10% cellspacing=0><tr align=center valign=top><td valign=bottom nowrap><font face=arial,sans-serif size=-1>Result Page: </font></td> <td><IMG SRC=/intl/en_extra/nav_first.gif alt=""></td> <td><IMG SRC=/intl/en_extra/nav_current.gif alt=""><br><span class=nav><font color=#A90A08>1</font></span></td>it seems that <div class=nav> seems to be a good indicator that the results are finished. As such we will add
termination=<div class=nav>to our wrapper. We also need to tell TESS that the wrapper is finished by adding an "end" at the end.
So the completed wrapper looks like:
host=www.google.com url=/search? method=get arguments=q, default1= columns=name,description,url, prefix=seconds rowprefix=^<p> rowterm=Similar pages name_prefix=<A HREF=[^>]*> name_term=</A> description_prefix=Description: description_term=<br> url_prefix=font color=green> url_term=- description_nullifmissing termination=<div class=nav> endCongratulations, you've just written your first TESS wrapper!