Web DevCenter    
 Published on Web DevCenter (http://www.oreillynet.com/javascript/)
 See this if you're having trouble printing code examples


O'Reilly Book Excerpts: Spidering Hacks

Spidering Hacks

by Morbus Iff and Tara Calishain

Editor's note: This week we offer two hacks from Spidering Hacks that will save you time and extra trips to your favorite web sites. And check back to this space next week for two more hacks from the book; the first will be on scraping all the URLs in a specified subcategory of the Yahoo directory; the second will be on using a bit of Perl to quickly find the word you're looking for in either an online dictionary or thesaurus.

Hack #24: Painless RSS with Template::Extract

Wouldn't it be nice if you could simply visualize what data on a page looks like, explain it in template form to Perl, and not bother with the need for parsers, regular expressions, and other programmatic logic? That's exactly what Template::Extract helps you do.

One thing that I'd always wanted to do, but never got around to doing, was produce RSS files for all those news sites I read regularly that don't have their own RSS feeds. Maybe I'd read them more regularly if they notified me when something was new, instead of requiring me to remember to check.

Spidering Hacks

Related Reading

Spidering Hacks
100 Industrial-Strength Tips & Tools
By Morbus Iff, Tara Calishain

One day, I was fiddling about with the Template Toolkit (http://www.template-toolkit.com/) and it dawned on me that all these sites were, at some level, generated with some templating engine. The Template Toolkit takes a template and some data and produces HTML output. For instance, if I have the following Perl data structure:

@news = (
        { date => "2003-09-02", subject => "Some News!",
          content => "Something interesting happened today." }, 
        { date => "2003-09-03", subject => "More News!",
          content => "I ran out of imagination today." } 
);

I can apply a template like so:

<ul>
    [% FOREACH item = news %]
        <li> <i> [% item.date %] </i> - <b> [% item.subject %] </b>
            <p> [% item.content %] </p>
        </li>
    [% END %]
</ul>

I'll end up with some HTML that looks like this:

<ul>
        <li> <i> 2003-09-02 </i> - <b> Some News! </b>
            <p> Something interesting happened today. </p>
        </li>
        <li> <i> 2003-09-03 </i> - <b> More News! </b>
            <p> I  ran out of imagination today. </p>
        </li>
</ul>

Okay, you might think, very interesting, but how does this relate to scraping web pages for RSS? Well, we know what the HTML looks like, and we can make a reasonable guess at what the template ought to look like, but we want only the data. If only I could apply the Template Toolkit backward somehow. Taking HTML output and a template that could conceivably generate the output, I could retrieve the original data structure and, from then on, generating RSS from the data structure would be a piece of cake.

Like most brilliant ideas, this is hardly original, and an equally brilliant man named Autrijus Tang not only had the idea a long time before me, but—and this is the hard part—actually worked out how to implement it. His Template::Extract Perl module (http://search.cpan.org/author/AUTRIJUS/Template-Extract/) does precisely this: extract a data structure from its template and output.

I put it to work immediately to turn the blog of one of my favorite singers, Martyn Joseph (http://www.piperecords.co.uk/news/diary.asp), into an RSS feed. I'll use his blog for the example in this hack.

First, write a simple bit of Perl to grab the page, and tidy it up to avoid tripping over whitespace issues:

#!/usr/bin/perl

my $page = get(" http://www.piperecords.co.uk/news/diary.asp" );
exit unless $page;
$page = join "\n", grep { /\S/ } split /\n/, $page;
$page =~ s/\r//g;
$page =~ s/^\s+//g;

This removes blank lines, DOS line feeds, and leading spaces. Once you've done this, take a look at the structure of the page. You'll find that blog posts start with this line:

<!--START OF ABSTRACT OF NEWSITEM-->

and end with this one:

<!--END OF ABSTRACT OF NEWSITEM-->

The interesting bit of the diary starts after the close of an HTML comment:

-->

After a bit more observation, you can glean a template like this:

-->
[% FOR records %]
    <!--START OF ABSTRACT OF NEWSITEM-->
    [% ... %]
    <a href="[% url %]"><acronym title="Click here to read this article">
    [% title %]</acronym></a></strong> &nbsp; &nbsp; ([% date %]) <BR>
    [% ... %]<font size="2">[% content %]</font></font></div>
    [% ... %]
    <!--END OF ABSTRACT OF NEWSITEM--> 
[% END %]

The special [% ... %] template markup means "stuff" or things that we don't care about; it's the Template::Extract equivalent of regular expression's .*. Now, feed your document and this template to Template::Extract:

my $x = Template::Extract->new(  );
my $data = $x->extract($template, $doc);

You end up with a data structure that looks like this:

$data = { records => [
         { url => "...", title => "...", date => "...", content => "..." },
         { url => "...", title => "...", date => "...", content => "..." },
           ...
        ]};

The XML::RSS Perl module [Hack #94] can painlessly turn this data structure into a well-formed RSS feed:

$rss = new XML::RSS;
$rss->channel( title => "Martyn's Diary", 
               link => "http://www.piperecords.co.uk/news/diary.asp" ,
               description => "Martyn Joseph's Diary" ); 

for (@{$data->{records}}) {
        $rss->add_item(
            title => $_->{title},
            link => $_->{url},
            description => $_->{content}
        );
} 

print $rss->as_string;

Job done—well, nearly.

You see, it's a shame to have solved such a generic problem—scraping a web page into an RSS feed—in such a specific way. Instead, what I really use is the following CGI driver, which allows me to specify all the details of the site and the RSS in a separate file:

#!/usr/bin/perl -T
use Template::Extract;
use LWP::Simple qw(get);
use XML::RSS;
use CGI qw(:standard);
print "Content-type: text/xml\n\n";
my $x = Template::Extract->new(  );
my %params; 

path_info(  ) =~ /(\w+)/ or die "No file name given!";
open IN, "rss/$1" or die "Can't open $file: $!";
while (<IN>) { /(\w+): (.*)/ and $params{$1} = $2; last if !/\S/; } 

my $template = do {local $/; <IN>;};
$rss = new XML::RSS;
$rss->channel( title => $params{title}, link => $params{link},
               description => $params{description} );

my $doc = join "\n", grep { /\S/ } split /\n/, get($params{link});
$doc =~ s/\r//g;
$doc =~ s/^\s+//g;

for (@{$x->extract($template, $doc)->{records}}) {
    $rss->add_item(
        title => $_->{title},
        link => $_->{url},
        description => $_->{content}
    );
}

print $rss->as_string;

Now I can have a bunch of files that describe how to scrape sites:

title: Martyn's Diary
link: http://www.piperecords.co.uk/news/diary.asp
description: Martyn Joseph's diary 
--> 
[% FOR records %]
    <!--START OF ABSTRACT OF NEWSITEM-->
    [% ... %]
    <a href="[% url %]"><acronym title="Click here to read this article"> 
    [% title %]</acronym></a></strong> &nbsp; &nbsp; ([% date %]) <BR>
    [% ... %]<font size="2">[% content %]</font></font></div>
    [% ... %]
    <!--END OF ABSTRACT OF NEWSITEM--> 
[% END %]

When I point my RSS aggregator at the CGI script (http://blog.simon-cozens.org/rssify.cgi/martynj), I have an instant scraper for all those wonderful web sites that haven't made it into the RSS age yet.

Template::Extract is a brilliant new way of doing data-directed screen scraping for structured documents, and it's especially brilliant for anyone who already uses Template to turn templates and data into HTML. Also look out for Autrijus's latest crazy idea, Template::Generate (http://search.cpan.org/author/AUTRIJUS/Template-Generate/), which provides the third side of the Template triangle, turning data and output into a template.

Simon Cozens

Hack #37: Downloading Comics with dailystrips

Love comics but hate visiting multiple sites for your daily dose? Automate your stripping with some easy-to-use open source Perl software.

It's hard to believe that, across all the cultures of the Internet, there's one common denominator of humor. Can you guess what it is? No, no; it's not the "All Your Base Are Belong to Us" videos. It's the comic strip. Whether you're into geek humor, political humor, or unfortunate youngsters forever failing to kick a football, there's a comic strip for you.

In fact, there may be several comic strips for you. There may be so many that it's a pain to visit all the sites containing said comic strips to view them. But there's a great piece of software available to ease your woes: dailystrips grabs all the strips for you, presenting them in one HTML file. Combine it with cron [Hack #90] and you've got a great daily comic strip supplement right in your mailbox or web site. The author, Andrew Medico, makes it clear that if you set this up to run on a web site, you must ensure that you've configured your site to restrict access to you alone or risk some legal consequences.

Getting the Code

dailystrips is available at http://dailystrips.sourceforge.net/, and this hack covers Version 1.0.27. There are two components to the program: the program itself and the definitions file, which defines the details of the available comic strips. As of this writing, dailystrips supports over 500 different comic strips. Once you've downloaded the program, go back to the download page and grab the latest definitions file, which is updated often. Save it over the strips.def file that comes packaged in the ZIP archive with the application.

Running the Hack

After installation (see the INSTALL file or installation instructions online at http://dailystrips.sourceforge.net/1.0.27/install.html), dailystrips runs from the command line with several options. Here are a few of the more important ones:

--list

Lists available strips

--random

Downloads a random strip

--defs filename

Uses a user-specified strips definition file

--local

Saves strips to a local HTML file rather than the default of STDOUT

--help

Prints a list of available options

To grab the latest "Get Fuzzy" comic and save to a local file, run:

% perl dailystrips --local getfuzzy

While the program is running, you'll get a count of any errors in retrieving the images of the strips. From my experiments, it looked like the nonsyndicated comics were easier to get and more consistent than the syndicated ones.

Once the program is finished, it'll either spit some HTML to STDOUT or, if you've enacted the --local option, save the strips to an HTML file named using the current date. The file will save into the dailystrips directory.

Hacking the Hack

In this hack, we're not hacking the hack so much as hacking the defs file. The defs file defines from where the strips are retrieved and the code snippets that are used to retrieve them. The defs file also includes groups, which are shortcuts to retrieving several comics at once. More extensive information on how to define strips is available from the README.DEFS file.

Defining strips by URL

The first way to define new strips is by generating a URL based on the current date. Here's an example for James Sharman's "Badtech" comic:

strip badtech
    name Badtech
    artist James Sharman
    homepage http://www.badtech.com/
    type generate
    imageurl http://www.badtech.com/a/%-y/%-m/%-d.jpg
    provides any
end

The first line specifies a unique strip name that you'll use to add the strip to a group or get it from the command line. The second line, name, specifies the name of the strip to display in the HTML output. Next, artist includes the name of the illustrator, which will also display in the HTML output. The fourth line determines the home page of the strip, and the fifth line specifies how the strip is found. In this case, we're generating a URL. imageurl specifies the URL of the comic, and %-y, %-m, and %-d specify the year, month, and day, respectively.

The final line, provides, indicates which types of strips the definition can provide: either any for a definition that can provide the strip for any given date, or latest for a definition that can provide only the current strip.

Finding strips with a search

The other type of URL generation, searching, is as follows:

strip joyoftech
    name The Joy of Tech
    homepage http://www.joyoftech.com/joyoftech/
    type search
    searchpattern <IMG.+?src="(joyimages/\d+\.gif)\"
    matchpart 1
    baseurl http://www.joyoftech.com/joyoftech/
    provides latest
end

Notice that the options are similar to the options in the previous example. The strip, name, and homepage options function as they do in the first example, but the type option is now search. With this type, you need to include a searchpattern, which specifies a Perl regular expression that will match the strip's URL. The matchpart line tells the script which paranthetical section to match. In this example, there's only one parenthetical section.

baseurl is necessary only when the searchpattern line does not match a full URL (as in this instance). When specified, it's prepended to whatever the regular expression of searchpattern matches.

Gathering strips into a group

If you want to get a set of the same comic strips every day, it's kind of a pain to type them all in. dailystrips lets you specify a group name that gathers several comic strips at the same time. Groups go at the top of the definitions file and look like this:

group andrews
    desc Andrew's Favorite Strips
    include userfriendly dilbert foxtrot
    include pennyarcade joyoftech thefifthwave monty bc
    include wizardofid garfield adamathome
end

group is the name of the group, and desc is its descriptive blurb. On each line after that, use the word include and whatever strips you want gathered into the group. As you can see, there are 11 strips in this group. When you're finished, put end on its own line. You call groups of strips with a @, as in this example:

% perl dailystrips -l @andrews

Morbus Iff is the coauthor of Mac OS X Hacks, author of Spidering Hacks, and the alter ego of the pervasively strange Morbus Iff, creator of disobey.com, which bills itself as "content for the discontented."

Tara Calishain is the creator of the site, ResearchBuzz. She is an expert on Internet search engines and how they can be used effectively in business situations.


Return to the Web Development DevCenter.

Copyright © 2009 O'Reilly Media, Inc.