| Subcribe via RSS

Perl Module Monday: HTTP::Tiny

October 17th, 2011 | 2 Comments | Posted in CPAN, HTTP, Perl

I’m still deep in the Stanford AI class, so this will be a light-weight posting. And since it’s going to be light-weight anyway, I’ll cover a module in the *::Tiny namespace: HTTP::Tiny.

HTTP::Tiny is a simple HTTP/1.1 client library with plenty of options. It handles HTTPS (if you have IO::Socket::SSL available) as well as HTTP requests, and does all the basic HTTP verbs. As is the case with most *::Tiny modules, the goal is to do as much as one can, without the overhead or dependency chain of a larger module. In this case HTTP::Tiny stands as a replacement for LWP::UserAgent, for those cases when you don’t need the full functionality that LWP provides.

The main methods of HTTP::Tiny that you’re likely to utilize (besides the constructor) are request() and get() (which is just a front-end to request(), with the ‘method’ argument set to GET). There is also a method called mirror(), which is handy for making a local copy of a web resource on your filesystem. mirror() even sets an “If-Modified-Since” header on the request, if the file already exists. A nice touch to have added! The request() method allows for a very useful range of options, that make it easy to pass specific headers, use call-back subroutines for either (or both) of the request body or the processing of the response, and provide trailer headers for chunked transfer-encoding. One thing I find curious, though, is why the author provides a short-hand method for the GET request, but not for the other verbs. Since all are called using the same semantics, it seems to me like it would have made as much sense to provide head(), put(), etc.

Still, it’s a nice little approach to HTTP communication, that doesn’t require as much setting-up of resources as LWP generally does. It doesn’t have the flexibility that LWP does, either, but sometimes you just don’t need that. You just need to get going in a few lines:

use HTTP::Tiny;

my $http = HTTP::Tiny->new();

for my $url (@ARGV)
{
    (my $file = $url) =~ s{^.*/}{};
    if (! $file)
    {
        warn "Skipping $url (no file component)\n";
        next;
    }
    $http->mirror($url, $file);
}

The above just mirrors all the URLs passed in via @ARGV, using the last file element of the URL as the file name to save to. It doesn’t have the progress-bar and summary that LWP’s “lwp-download” has, but it gets the job done.

So have a look, this could be a useful addition to your toolkit, sitting beside LWP and handling some of the simpler tasks for it.

Tags: , , ,

No PMM This Week

October 10th, 2011 | No Comments | Posted in Perl

Alas, I didn’t get this done earlier in the day, and now I need to spend the remainder of my evening working on the first units in the Stanford on-line AI class. These materials were only just posted, but I’m already behind the curve because I’ve not reviewed all the pre-class material. Hopefully I’ll be able to get a PMM candidate picked out for next week and get the post written before it gets this late in the day.

Tags: ,

Perl Module Monday: IMDB::Film

October 3rd, 2011 | 2 Comments | Posted in CPAN, Perl

For this week’s PMM, I’m going to go with something a little more fun: the IMDB::Film module. Though, to be fair, I’ll be offering it up with some caveats and reservations.

Still, I’m a huge fan of movies; I try to see a new film every week or two, and my DVD collection has out-grown two different shelves. I’ve even gone so far as to get an Android app on my phone (Packrat) for the sole purpose of keeping track of my collection so that I don’t impulse-buy something I already have (usually because I’ve found it on sale). And don’t get me started on slowly replacing my most-favorite films with Blu-Ray copies! Anyway, I’ve also been a huge fan of the IMDb web site since it first got its start. But they don’t offer an API to their data (which I find strange, given their huge reliance on open-source software and user-generated content). Until and unless they see the error of their ways, we’ll have to get by with modules like IMDB::Film, which does a lot of the heavy-lifting when it comes to screen-scraping IMDb.

The IMDB::Film class (and the companion IMDB::Persons class) handles all the page-fetching and parsing that you would otherwise have to do, and presents you with a reasonably-encapsulated object representing an IMDb film (or person). Based on the criteria you give it, it either goes directly to the necessary page, or it does a search and returns you the first matching record (along with enough additional information to get the remaining matched records). For example, the snippet here:

use IMDB::Film;

my $film = IMDB::Film->new(crit => 'Harry Potter');

This returns as the match in $film, “Harry Potter and the Sorcerer’s Stone”. And calling $film->matched(), you get an array-reference to the 43 (!) total matches for the string, “Harry Potter”. Part of each hash-reference in those 43 slots is the IMDb key for the given title, meaning you can fetch the subsequent titles without first going to the search form:

my $other_film = IMDB::Film->new(crit => $film->matched->[0]->{id});

This will go directly to that page and fill in $other_film with the info from it. Read the docs for the class to see the other accessors you can call, and see the docs for the IMDB::Persons class for what you can do with it. In particular, the cast() method on a film object will give you a list-reference of hash-references, one key of which is the IMDb ID for each cast member. You can use this to get their page info with IMDB::Persons.

Now, the dreaded caveats and reservations:

  • The current version (0.51 as of this writing) has left some debugging lines in the code, so calls to new() (in both the ::Film and ::Persons classes) send cruft to STDOUT.
  • And, by the way, why call one class “Film” (singular) and the other class “Persons” (plural)? I consider that bad design.
  • The cast() method only lists the cast that are listed on the main page of the film’s IMDb entry. In the Harry Potter example, this means only the first 15 people, most of whom are actually minor players.
  • In general, there seems to be no deeper-drilling for any information— you can get the short bio for an actor, but not the full bio for example.
  • You can get URLs for certain of the data elements (images, etc.), but not for the full page itself. If I wanted to extract data for Tom Cruise, for example, then render that data along with a link back to the IMDb page for him, I cannot get that URL from the IMDB::Persons record for Tom Cruise. This despite the fact that it had to have fetched that URL to get the data.

There are other minor nits, but those are the high points. I will be watching this module, to see if any of these get addressed (and I opened an RT ticket for the errant debugging messages, hopefully that will be addressed in the next release). But while I may seem to be harsh on it, I still think it’s a useful little module, and worth playing around with. Scraping IMDb is no small task, and I’m glad someone is doing the grunt-work of keeping up with their content-layout changes.

Tags: , ,