I had to do some screen scrapping yesterday and, while I previously have used hpricot for these kinds of things (or maybe even just plain cURL or similar), this time the page required me to be logged in, and not using any HTTP “standard” way of Auth that cURL could deal with, but a custom CMS html-based login form. So I understood that I needed something more powerful; something that could pass as a “proper” browser and not just a simple crawler.
I had heard quite nice successful experiences with Mechanize before, so I decided I would give it a try, and it turned out to vastly exceed what I expected from it! 🙂
For some reason my mind was in the ruby-mood yesterday, so I decided to go for the ruby port. Everything was flowing nicely between ruby and me. I even dared prototyping what I wanted to do in a terminal, using irb, before writing the final script I needed to write. My ruby muscle was getting toned again, so to speak.
I got some false starts, though. I first invoked just ‘ruby’ in the terminal, only to be greeted with a waiting pipe, i.e. nothing happened as –I guess– ruby was expecting me to provide him with something to run. I remembered that the interactive ruby executable was called irb. A Control+C and four other keystrokes later, I was in immediate-mode ruby.
Another gotcha: not having to
when running in irb, but having to when writing a script. Thankfully that I remembered and was able to fix quickly.
But I’m digressing. Back to Mechanize: it simulates a real browser interacting with a website. You can even fake the user agent. But unlike the most basic screen-scrappers, which simply download a page and then do something with it and then download another one and do something else, without keeping any sort of continuity between downloads and connections, Mechanize is stateful. Which means that it keeps the state between ‘visited’ pages. To all effects, and for the visited websites, there’s a normal browser at the other end of the line. Unless you go crazy and begin hammering a server with tons of requests, crawling websites this way might be almost invisible to servers.
Which is exactly what I wanted!
The syntax is quite nice and intuitive. Borrowing shamelessly from the manual:
agent = Mechanize.new
page = agent.get('http://google.com/')
# List links on the page
page.links.each do |link|
puts link.text, link.href
# Click on the first link with text 'News'
page = agent.page.link_with(:text => 'News').click
# Do a Google search, using the search form
google_form = page.form('f')
google_form.q = 'ruby mechanize'
page = agent.submit(google_form, google_form.buttons.first)
Something that made me even more happier is that Mechanize also uses Nokogiri internally for parsing HTML–which means we can do the same style of nice DOM tree traversing that I used to do with Hpricot, only even better! (Nokogiri is the successor to Hpricot).
I didn’t need that last feature for this particular case, but it left me wondering what I could use Mechanize for–in order to use this! I will have a look at my TODO list and see where can I use my newly acquired Mechanize skills!