Screen Scraping for RSS

I like to read online comics. Unfortunately some of them do not publish RSS feeds which is retarded. I ranted about this on Monday. But hey, if they don’t make one, I will do it for them.

I wrote a nice little perl script that screen scrapes a page for an image, and then generates an RSS feed. It requires WWW::Mechanize and XML::RSS modules that can be downloaded from CPAN or some other repository.

#    This program is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see .

use WWW::Mechanize;
use XML::RSS;
use strict;
use warnings;

die("Usage: grab.pl url pattern") if(scalar @ARGV != 2);

my $mech = WWW::Mechanize->new();

my $url = $ARGV[0];
my $match = $ARGV[1];

$mech->get( $url ) or die("poop");

my $img = $mech->find_image(url_regex => qr/$match/i);

my $item;

if(substr($img->url, 0, 4) eq "http")
{
	$item = $img->url;
}
else
{
	$item = $url . $img->url;
}

my $img_tag = '';

my $feedname = substr($url, 7);
$feedname =~ s/\///g;

(my $Second, my $Minute, my $Hour, my $Day, my $Month, my $Year, my $WeekDay, my $DayOfYear, my $IsDST) = localtime(time);
$Year += 1900;

my @days = ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun");
my @mon = ("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec");

$WeekDay = $days[$WeekDay];
$Month = $mon[$Month];

my $date = "$WeekDay, $Day $Month $Year, $Hour:$Minute:$Second EST";

 my $rss = new XML::RSS (version => '2.0');
 $rss->channel(title          => $feedname,
               link           => $url,
               language       => 'en',
               description    => 'Commic Feed',
               rating         => '(PICS-1.1 "http://www.classify.org/safesurf/" 1 r (SS~~000 1))',
               copyright      => 'Copyright $url',
               pubDate        => $date,
               lastBuildDate  => $date,
               #docs           => 'http://www.blahblah.org/fm.cdf',
               managingEditor => 'maciakl1@mail.montclair.edu',
               webMaster      => 'maciakl1@mail.montclair.edu'
               );

$rss->add_item(title => "Cartoon for " . $date,
        permaLink  => $item,
        enclosure   => { url=>$item, type=>"application/x-data" },
        description => $img_tag
);


$rss->save("$feedname.xml");

How does it work? You simply call it with:

perl grab.pl url pattern

Where url is the url of your web comic, and pattern is some string that is unique to the URL of the actual comic image. For example, extralife is easy because the front page image is always current.gif (you can use this as a pattern). DorkTower on the other uses variable image names, but all the pictures are stored in /comics/dorktower/images/comics/ directory. Furthermore, none of the advertisement, or background images are stored in a dir called comics – so I picked “comics” as a pattern.

Essentially, you have to look closely at the code of the page you are scraping once, and pick a good pattern attribute. The feed is created in the same directory as the script. To generate the file name I drop the http:// part from the url, remove all the slashes and append .xml at the end. I could add another optional attribute to specify the feed name, but I don’t really care about it. Feel free to do it yourself.

Just a side note, if you plan running this on windows with ActiveState perl and you use ppm for your module management make sure you get WWW::Mechanize 1.4 or higher. The 0.72 package that can be downloaded from the ActiveState repository does not support the find_image function I’m using.

You might want to add http://theoryx5.uwinnipeg.ca/ppms/ to the ppm repository list. You can download a more recent version from there.

This entry was posted in Uncategorized. Bookmark the permalink.



4 Responses to Screen Scraping for RSS

  1. Wow, this is exaclty what I was looking for the other day! Some comics include the comic inside the feed. Many don’t, so I can use this to make my improved feed.

    Reply  |  Quote
  2. Your original source link is broken now, so the HTML version is the only one left. Oh, any chance you could license it to me? :-D Like GPL or BSD or whatever?

    Reply  |  Quote
  3. Luke Maciak UNITED STATES Mozilla Firefox Windows Terminalist says:

    Ok, I slapped GPL boilerplate on top of it. Feel free to use it in any GPL compatible ways you wish. :)

    Reply  |  Quote
  4. First of all, the week begins with Sunday. ;-)

    And maybe this wasn’t true back then, but you can use ->url_abs() in place of ->url() so you don’t have to manually build the absolute one. So you can replace,

    my $item;
     
    if(substr($img->url, 0, 4) eq "http")
    {
    	$item = $img->url;
    }
    else
    {
    	$item = $url . $img->url;
    }
     

    with,

    my $item = $img->url_abs;
    

    This will also allow it to properly handle redirects.

    Reply  |  Quote

Leave a Reply

Your email address will not be published. Required fields are marked *