Recently in Perl Category

Scraping ASP.NET sites with Perl

| 3 Comments | No TrackBacks

Today at work I needed to locate and extract, automatically, some information from a website.

There was no direct URL to the information I needed, some fields had to be filled and some POST forms had to be submitted.

Normally I would use WWW::Mechanize for such a task, but in this particular instance the situation was made somewhat less managable because the site in question was implemented with ASP.NET.

The problem with this is that every link has an associated JavaScript event handler which does some housekeeping, assigns things to funnily named hidden input fields like __EVENTTARGET and __EVENTARGUMENT and then POSTs a form.

My first thought was to try and find a CPAN module which handles those complications. Not surprizingly, there is one, aptly named HTML::TreeBuilderX::ASP_NET.

According to its documentation, the module works in combination with the standard LWP::UserAgent and HTML::TreeBuilder, and converts ASP.NET JavaScript posting redirects into an HTTP::Request object which can be fed to LWP::UserAgent’s request() method. Just what the doctor ordered.

However, it turned out that my joy was a bit premature:

  • it requires Perl 5.10, which we do not yet have on our production systems;
  • documentation is incomplete and inaccurate at times - it insists naming its httpRequest() method as httpResponse();
  • it fails its own tests, not only on two machines I have tried to run them, but also on a lot of other systems according to CPAN Testers.

After a bit of pondering I decided that spending time on trying to fix the HTML::TreeBuilderX::ASP_NET module is a bit counter-productive - I needed the working code soon.

So what to do?

One thing we should keep in mind is that those JavaScript postbacks do not do anything fancy. The hidden fields that are filled in depend on what was clicked on the page, nothing else. After they are filled, a normal POST occurs.

So if we know what to POST, we could just use WWW::Mechanize and get the job done easily and quickly.

So the solution naturally splits into two parts - finding out what fields to set, and automating the process.

The first part is to launch a browser, do clicking and entering by hand, and capture what gets POSTed at each step. This capturing could be done by a variety of methods:

  • tcpdump/wireshark - listen to ‘em on the wire!
  • having a proxy which outputs the POSTed parameters;
  • using a browser extension that shows POSTed parameters.

I have chosen the second option, since I had a script similar to what I need already, and since it is easy to filter out any parameters which I did not want to see, like __VIEWSTATE, which can easily be several kilobytes long.

Enter spyproxy.pl:

#! /usr/bin/perl
use strict;
use warnings;
use HTTP::Proxy;
use CGI;

my $proxy = HTTP::Proxy->new(host => "localhost");
$proxy->logmask(32); # 32 - FILTERS
$proxy->push_filter(
        request => Spy::BodyFilter->new(),
);
$proxy->start;

package Spy::BodyFilter;
use base qw(HTTP::Proxy::BodyFilter);

sub will_modify { 0 }

sub filter
{
    my ($me, undef, $req) = @_;
    print $req->method, " ", $req->uri, "\n";
    return unless $req->method eq "POST";
    my $body = $req->content;
    my $q = new CGI($body);
    for my $p ($q->param) {
        next if $p eq "__VIEWSTATE";
        print "$p\n\t", $q->param($p), "\n";
    }
}

Launch it locally in a terminal, set your browser’s proxy settings to localhost:8080, and watch the output in the terminal.

The second part of the puzzle is to use the wonderful WWW::Mechanize::Shell. It provides an interactive shell, in which we can issue GET requests, see the content of the responses, view links, forms, and form fields with their values, follow the links, set the value of the fields, click on buttons and submit the forms. Best of all, after getting what we are after we can issue a script command and get a piece of Perl code that will perform all the tasks we’ve just done.

So the final solution looks like this:

  1. Load the start page in your browser (through the spyproxy).
  2. Load the same page in WWW::Mechanize::Shell.
  3. In the browser, fill in any fields that need filling, and click where you want.
  4. Observe the spyproxy output, note any fields that need setting. In a typical ASP.NET application, you will want to ignore the vast majority of the fields at any given moment. Don’t worry, humans are good at this sort of pattern recognition. :-) Pay special attention to __EVENTTARGET and __EVENTARGUMENT fields.
  5. Set the same fields to the same values in the shell (use value fieldname fieldvalue).
  6. If __EVENTTARGET was set, type submit in the shell; otherwise, find the name of the button that was pressed (see step 4), and type click buttonname in the shell;
  7. Examine the content of the response (content in the shell) to make sure that what you’ve got in the shell makes sense.
  8. If more clicking and entering is to be done, go to step 3.
  9. Type script script-name.pl in the shell.
  10. Go edit script-name.pl - remove any prints you do not need, change constants you entered in the fields with variables where needed.
  11. Your custom scraping script is ready to use.
  12. Profit!

I hope this trick will be of use to somebody. Enjoy!

port-tags on github

| No Comments | No TrackBacks

Some years ago I’ve made a little web application which allowed one to browse FreeBSD ports collection by tags, à la delicious.

The tags were not created by users but were instead generated from a couple of fields taken from every port’s Makefile, so it was not exactly a “social” software.

There was some limited amount of discussion on FreeBSD mailing lists, and a publicly accessible readonly SVN repository was created by my friend Erwin, but the overall interest was rather low.

Over time I moved on and basically stopped working on the project, but recently I had an idea - not exactly to re-surrect it, but to make it more easy for people who are interested to contribute.

Enter port-tags at github. Github is a tool to host git repositories of your open-source projects. Anybody can easily clone your repository, fork it completely, or submit their changes back to you. I only started using it today, so I cannot say much about its features and how convenient they are, but from what I’ve heard, it is very very nice.

So, if you are interested, and have got round tuits to spare, please hack on port-tags - maybe some good will eventually come out of it.

Unknown CPAN III: File::SortedSeek

| No Comments | No TrackBacks

Let’s suppose that you have a huge logfile and would like to quickly extract lines from it that relate to a given small time interval. How would you do it?

Since the lines are ordered by time specification, the fastest way (provided you do not keep indexes of any sort) is to do the old good binary search, doing all necessary housekeeping to account for line boundaries and converting the timestamp from whatever format it is in the logfile to epoch seconds for comparison with the target interval boundaries.

Since you are dealing with Perl here, it would be natural to first look on CPAN for a module which somebody else has already written to do just this.

And of course somebody has. Enter File::SortedSeek by Dr. James Freeman. The module interface is a bit weird, so it pays off to read the documentation carefully.

At any rate, here is a complete program that handles the task, assuming that the timestamp (in pretty much any format) is at the beginning of each line of the logfile:


#! /usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
use File::SortedSeek;
use Time::ParseDate;

my ($from, $to);
usage() unless GetOptions("from=s" => \$from, "to=s" => \$to);
usage() unless @ARGV == 1;
$from = parsedate($from) if $from;
$to   = parsedate($to)   if $to;

my $filename = shift;

open L, "< $filename" or die "unable to open $filename: $!\n";
File::SortedSeek::set_silent(1);

my $end = File::SortedSeek::numeric(*L, $to, \&time2sec)   if $to;
my $beg = File::SortedSeek::numeric(*L, $from, \&time2sec) if $from;
$end ||= 0;  $beg ||= 0;
while (<L>) {
    print;
    $beg += length($_);
    last if $end && $beg > $end;
}

sub usage
{
    print STDERR <<EOF;
usage:
\t$0 --from date-time [--to date-time] filename
\t$0 -f date-time [-t date-time] filename
EOF
    exit 1;
}

sub time2sec
{
    my $line  = shift;
    return undef unless defined $line;
    my $r = parsedate($line, FUZZY => 1);
    $r;
}

Nifty, eh?

Unknown CPAN II: Time::ParseDate

| No Comments | No TrackBacks

There is a number of very good, but not very well known Perl modules on CPAN.

Sometimes I’ll be writing short posts about such modules which I use and appreciate.

When you are dealing with date and time in Perl, inevitably you will reach a point when you need to do more than is immediately available through Perl builtins and the POSIX module.

Then you try to find a module for what you want on CPAN, and you drown in literally hundreds of modules dealing with dates and times.

Luckily, there is a clear winner in this “modules war” - everybody (or at least everybody sane) recommends to use the DateTime module, and for the things that it cannot do, various other modules from the same namespace.

So life is bright for a perl programmer on the date/time front, until you have a need to parse a date represented in one of a multitude of “human-readable” formats, and you don’t know in advance which one it is going to be.

The DateTime itself cleverly refuses to deal with this task at all, and instead recommends to use one of the DateTime::Format:: modules.

You will be relieved to know that you can easily and quickly create parsers for your own date formats - that is, if you are able to remember that you should use the module aptly named DateTime::Format::Builder::Parser::Regex.

The documentation for DateTime::Format::Bork is also very enlightening.

Aaaaanyway.

I prefer to go against the flow here, and use a module somewhat unfortunately named Time::ParseDate. I mean, it could just as easily be Date::ParseTime or something, right? Worse, for years I had trouble remembering what distribution this modules comes from (it, very obviously for everyone but me, can be found in the Time-modules distribution).

At any rate, if we forget for a second about the funny names, this module is truly a wonder:

$ perl -MTime::ParseDate -le 'print parsedate("Sat Feb 14 00:31:30 2009")'
1234567890
$ perl -MTime::ParseDate -le 'print parsedate("2 days ago")'
1236443283
$ perl -MTime::ParseDate -le 'print parsedate("18:30")'
1236619800

It exports a single function, which takes a single parameter (unless you want to specify some options which are rarely needed in practice), and you get your epoch seconds back in return. Very simple, very elegant, gets the job done. I wish there were more “straight to the point” modules like this one.

Unknown CPAN I: Sys::RunAlone

| No Comments | No TrackBacks

There is a number of very good, but not very well known Perl modules on CPAN.

Sometimes I’ll be writing short posts about such modules which I use and appreciate.

There is a common task of executing a script from cron periodically, subject to the following conditions:

  • a script can occasionally run a relatively long time (longer than the interval at which it is launched by cron);
  • such long runs do not happen often;
  • running two or more instances of the script at the same time will lead to all sorts of strange things happening and must be avoided;
  • skipping a single run is no big deal.

A usual method to prevent strange things from happening is to use a lockfile, like in this example:

use strict;
use warnings;

use Fcntl qw(:DEFAULT :flock);

sysopen(L, "/var/run/myprocess.lock", O_WRONLY | O_CREAT)
    or die "cannot open lockfile: $!";
flock(L, LOCK_EX | LOCK_NB)
    or die "cannot obtain lock: $!";

# ... do work here ...

unlink "/var/run/myprocess.lock"; # optional
close(L);

(You might want to silently exit when flock fails in order to not spam yourself with useless cron mails).

Nothing fancy, really. But over time it becomes boring to write all this housekeeping code in every little cronjob, especially since in some cases the “do work here” part can be comparable in size to the locking part.

The solution, of course, is to use the CPAN magic and to find a module which sweeps all this complexity under the rug, leaving us with a clean and simple interface, so that we can concentrate on getting the job done.

As usual with CPAN, there is not one, but several modules which were written to perform this task. Most of them are rather powerful, which is unfortunate, since we want simplicity of use above all else. The bells and whistles provided by those modules might be needed in certain situations, but for the purpose described above they just get in the way.

There is, however, a wonderfully simple (and a rather clever) module by Elizabeth Mattijsen, Sys::RunAlone. To get the same functionality as the code above, all you need to do is this:

use strict;
use warnings;
use Sys::RunAlone;

# ... do work here ...

__END__

That’s it. Nothing else to write.

There are only two minor things to remember about this module, if you want to avoid problems.

First, it uses the script’s DATA handle to do the locking (that is, it actually uses the script’s file itself). So if you have several symlinks pointing to the same script, you cannot run them at the same time for it is still one physical file and one DATA handle.

Second, and for the same reason, if you modify the script while it is running and then launch it again, it will fail to detect that another instance is already running, since the DATA handle will be different.

Just keep this in mind when you use it.

"Idiots can vote too"

| No Comments | No TrackBacks

My blood pressure was quickly raised by this: http://cpanratings.perl.org/user/dandv.

The gist of his so called reviews: “I did not use this module, but Catalyst has switched from it to something else, hence I rate it with 1 star out of 5. Avoid. Use Moose”.

WTF?? Whatever has happened to TIMTOWTDI? Who is this guy?

1 The title of this post shamelessly taken from kaare’s remark on #cph.pm

Stupid code examples in documentation

| No Comments

Dear maintainer of Spreadsheet::ParseExcel!

Please remove the

$sheet->{MaxCol} ||= $sheet->{MinCol};

statement from the loop over spreadsheet rows in the example at the top of the module documentation. People just cut and paste this into their code, which is pointless.

This madness goes far - I even saw this code in a presentation at the local Perl Mongers group technical meeting.

Take pity on poor wasted electrons, K THX.

Perl, maps, and geocaching

| No Comments

Being inspired by Edmund von der Burg’s talk at the recent Nordic Perl Workshop in Stockholm, Henrik, Lars, and myself started to play with Open Street Maps.

Some work-related goodness will probably come out of it. Meanwhile, we played with mapping the caches we’ve found in Stockholm during the workshop.

From this map I can deduce three things:

  • we need a life;
  • the Open Street Maps community in Stockholm has not reached a critical mass yet;
  • central Stockholm needs more regular geocaches.

And now - back to the scheduled silence.

Some time ago several people (most notably skv@) ranted about including a list of changes or a link to such list in the commit message for a port update.

I thought it was a great idea and started including a link to a CPAN’s distribution Changes file in my commits some time ago.

What I did not like was that those links looked like this:

http://search.cpan.org/src/JESSE/Template-Declare-0.27/Changes

FreeBSD’s commit messages are preserved in our repository and mail archives forever, for a suitable definition of “forever”. On the other hand, CPAN authors are encouraged to clean up old and obsolete versions promptly.

Thus there is a discrepancy between expected time of life of the link in the commit message and the link contents.

While older CPAN distributions can still be found on BackPAN, it only provides links to tarballs and not individual files like Changes.

Luckily, it turns out that version-less links like

http://search.cpan.org/dist/Template-Declare/Changes

work just fine, redirecting to the most recent version of the file. This is acceptable, since Changes is expected to be a prepend-only file, so the information the commit message was trying to link to can (almost) always be found there.

A fellow former Bloglines user has asked me whether I found a way to backup Google Reader subscriptions into an OPML file from cron, as we used to do with our Bloglines accounts.

A quick search turned up this, which, from the look of it, in order for it to work requires every feed to be explicitly marked with a tag which is set up as public.

This by itself is rather cumbersome, and you have to remember to do that for every new feed you subscribe to, otherwise you’ll defeat the purpose of making periodic backups in the first place.

Luckily, there is a better solution. There is a nice little module on CPAN, WebService::Google::Reader by gray, which uses an unofficial Google Reader API to do various nifty things with your Google Reader subscription, including OPML export.

This means that after installing the module you can simply put the following command into your crontab (only command itself is shown, see crontab(5) manual page to find out what else you will want to put in there):

env GOOGLE_USERNAME=your-username-typically@gmail.com \
  GOOGLE_PASSWORD=your-user-password \
  perl -MWebService::Google::Reader -e \
  'print WebService::Google::Reader->new(
     username => $ENV{GOOGLE_USERNAME},
     password => $ENV{GOOGLE_PASSWORD})->opml' \
  > /where/to/put/greader.opml

You will have to make the above to be one long line to satisfy crontab syntax, and of course remember to use a real username, password, and the path to the resulting OPML file.

Unfortunately, the most recent version of the module (which is 0.03 at the time of this writing) has a minor bug which prevents the opml() method from working correctly. So you will need to do a little patching.

Before installing the module, edit the source file lib/WebService/Google/Reader/Constants.pm, look for a string subscribtions, and fix the spelling (finding correct spelling is left as an exercise for the reader). Then proceed installing the module as usual.

Hopefully, this step won’t be necessary in a couple of days’ time when a new version of the module is released.

If you are a FreeBSD user like myself, you may choose instead to fetch a skeleton of the port of the module. Unpack it in /usr/ports/www/ and install it as you would any other port.

I intend to add the port to the ports collection as soon as our current ports freeze is over.

Enjoy!

About this Archive

This page is an archive of recent entries in the Perl category.

Musings is the previous category.

Rants is the next category.

Find recent content on the main index or look in the archives to find all content.