Today at work I needed to locate and extract, automatically, some information from a website.

There was no direct URL to the information I needed, some fields had to be filled and some POST forms had to be submitted.

Normally I would use WWW::Mechanize for such a task, but in this particular instance the situation was made somewhat less managable because the site in question was implemented with ASP.NET.

The problem with this is that every link has an associated JavaScript event handler which does some housekeeping, assigns things to funnily named hidden input fields like __EVENTTARGET and __EVENTARGUMENT and then POSTs a form.

My first thought was to try and find a CPAN module which handles those complications. Not surprizingly, there is one, aptly named HTML::TreeBuilderX::ASP_NET.

According to its documentation, the module works in combination with the standard LWP::UserAgent and HTML::TreeBuilder, and converts ASP.NET JavaScript posting redirects into an HTTP::Request object which can be fed to LWP::UserAgent's request() method. Just what the doctor ordered.

However, it turned out that my joy was a bit premature:

  • it requires Perl 5.10, which we do not yet have on our production systems;
  • documentation is incomplete and inaccurate at times - it insists naming its httpRequest() method as httpResponse();
  • it fails its own tests, not only on two machines I have tried to run them, but also on a lot of other systems according to CPAN Testers.

After a bit of pondering I decided that spending time on trying to fix the HTML::TreeBuilderX::ASP_NET module is a bit counter-productive - I needed the working code soon.

So what to do?

One thing we should keep in mind is that those JavaScript postbacks do not do anything fancy. The hidden fields that are filled in depend on what was clicked on the page, nothing else. After they are filled, a normal POST occurs.

So if we know what to POST, we could just use WWW::Mechanize and get the job done easily and quickly.

So the solution naturally splits into two parts - finding out what fields to set, and automating the process.

The first part is to launch a browser, do clicking and entering by hand, and capture what gets POSTed at each step. This capturing could be done by a variety of methods:

  • tcpdump/wireshark - listen to 'em on the wire!
  • having a proxy which outputs the POSTed parameters;
  • using a browser extension that shows POSTed parameters.

I have chosen the second option, since I had a script similar to what I need already, and since it is easy to filter out any parameters which I did not want to see, like __VIEWSTATE, which can easily be several kilobytes long.

Enter spyproxy.pl:

#! /usr/bin/perl
use strict;
use warnings;
use HTTP::Proxy;
use CGI;

my $proxy = HTTP::Proxy->new(host => "localhost");
$proxy->logmask(32); # 32 - FILTERS
$proxy->push_filter(
        request => Spy::BodyFilter->new(),
);
$proxy->start;

package Spy::BodyFilter;
use base qw(HTTP::Proxy::BodyFilter);

sub will_modify { 0 }

sub filter
{
    my ($me, undef, $req) = @_;
    print $req->method, " ", $req->uri, "\n";
    return unless $req->method eq "POST";
    my $body = $req->content;
    my $q = new CGI($body);
    for my $p ($q->param) {
        next if $p eq "__VIEWSTATE";
        print "$p\n\t", $q->param($p), "\n";
    }
}

Launch it locally in a terminal, set your browser's proxy settings to localhost:8080, and watch the output in the terminal.

The second part of the puzzle is to use the wonderful WWW::Mechanize::Shell. It provides an interactive shell, in which we can issue GET requests, see the content of the responses, view links, forms, and form fields with their values, follow the links, set the value of the fields, click on buttons and submit the forms. Best of all, after getting what we are after we can issue a script command and get a piece of Perl code that will perform all the tasks we've just done.

So the final solution looks like this:

  1. Load the start page in your browser (through the spyproxy).
  2. Load the same page in WWW::Mechanize::Shell.
  3. In the browser, fill in any fields that need filling, and click where you want.
  4. Observe the spyproxy output, note any fields that need setting. In a typical ASP.NET application, you will want to ignore the vast majority of the fields at any given moment. Don't worry, humans are good at this sort of pattern recognition. :-) Pay special attention to __EVENTTARGET and __EVENTARGUMENT fields.
  5. Set the same fields to the same values in the shell (use value fieldname fieldvalue).
  6. If __EVENTTARGET was set, type submit in the shell; otherwise, find the name of the button that was pressed (see step 4), and type click buttonname in the shell;
  7. Examine the content of the response (content in the shell) to make sure that what you've got in the shell makes sense.
  8. If more clicking and entering is to be done, go to step 3.
  9. Type script script-name.pl in the shell.
  10. Go edit script-name.pl - remove any prints you do not need, change constants you entered in the fields with variables where needed.
  11. Your custom scraping script is ready to use.
  12. ...
  13. Profit!

I hope this trick will be of use to somebody. Enjoy!

comment 1

I'm the author of HTTP::TreeBuilderX::ASP_NET, and it doesn't fail any tests... It broke a few times because of instabilities in Moose (you can read about it in changelogs) http://www.cpantesters.org/distro/H/HTML-TreeBuilderX-ASP_NET.html

The doc bug you mention is fixed in github (afaik) I can push another one up to CPAN but there is some new functionality there too that someone else has been testing. Please submit your failing tests to rt, I'll fix them!

As for perl 5.10, yes, it requires 5.10 which has been out for over two years. I would suggest you try using HTML::TreeBuilderX::ASP_NET::Roles::htmlElement which is dirt simple.

Grab me on irc or email me if you need any help.

Comment by Evan Carroll Tue Dec 29 19:18:11 2009
comment 2

Evan,

Re: failing tests; I'm afraid that you have not released 0.09 at the time I was trying to use your module. According to CPAN, 0.09 was uploaded on the 26th of August, the same day I wrote this entry. The actual experience predates that by at least hours and possibly days. The 0.08 was indeed failing tests (but you mention Moose was responsible).

I will certainly try your module again once I need more ASP scraping to do.

Cheers, \Anton.

Comment by tobez Tue Dec 29 19:58:45 2009
comment 3
Cool, no problem. I just saw your entry on programming.reddit.com today, and I didn't realize the posting was old. This is my bad, there was definitely a problem with it failing tests around that time, but I assure you it was upstream. ;) Blame the Moose guys.
Comment by Evan Carroll Wed Dec 30 01:18:20 2009

very interesting solution, i try to automate some task on work, authorization on jira,confluence and other network task

Comment by Николай Tue Oct 11 06:08:57 2011
Failed to install it with perl -MCPAN -e shell, complains about failed make test. I guess it wasn't fixed after all?
Comment by Ivan Fri Oct 14 15:54:30 2011

Failed to install it with perl -MCPAN -e shell, complains about failed make test. I guess it wasn't fixed after all?

Ivan, no idea, you should probably report this to the author, Evan Carroll

Comment by tobez Fri Oct 14 16:17:51 2011