Posts Tagged ‘hpricot’

20081024 How to install hpricot in Ubuntu 8.4

This could be considered a fresh installation, speaking in ruby terms. I just had ruby installed, no ruby gems, nor ruby dev nor anything else ruby. So this should be enough for installing hpricot as well as ruby gems (which are required for installing hpricot).

As you can see, I didn’t download any source file, instead I was happy with using apt-get and the hpricot version from ubuntu repositories, although they are relatively old (for example rubygems is more than a year old). If I find any problem and need to update to newer versions I’ll report that here ;-)

sudo apt-get install rubygems
sudo rm /var/lib/gems/1.8/source_cache
sudo gem update
sudo apt-get install ruby1.8-dev
sudo gem install hpricot

It’s a pity they don’t have a metapackage for ruby’s development files (the ruby1.8-dev package), the same way there’s a ruby metapackage which depends on the ruby1.8 package, so whenever ruby is updated it will update the ruby version as well, without the user having to worry about the version number.

Even more, I instinctively tried a naive sudo apt-get install rubydev and was greeted with a sad “Couldn’t find package rubydev”. It somehow proves that a metapackage called rubydev would be quite useful… at least for instinctive users.

Enjoy your screen scrapping!

20080325 Parsing a del.icio.us export with Hpricot

The trickiest part is to detect if a bookmark has a corresponding description. The export is in the same format that Netscape used for its bookmarks export, which means it is a simple html file with a definition list (dl) and a series of definition terms (dt). A term (=bookmarks) may have a description (dd).

But how do you detect if there’s a description? It seems the answer was rather simple: use term.next and if the next element’s name is dd, we’re lucky and have a description. The only problem was that I didn’t know how to access the name of an element, until I just thought: what if I simply use name? and guess what… it worked! So term.next.name was exactly what I looked for :-)

require 'rubygems'
require 'hpricot'

doc = open("delicious.html") {|f| Hpricot(f) }

bookmarks = []

(doc/"dl/dt").each do |term|
        link = (term/"a")
       
        if term.next and term.next.name == 'dd'
                desc = term.next.inner_text
        else
                desc = nil
        end
       
        if link.attr('tags')
                tags = link.attr('tags').split(",")
        else
                tags = nil
        end
       
        bookmarks << {
                :address                =>      link.attr('href'),
                :created_at     =>      link.attr('last_visit'),
                :tags                   =>      tags,
                :description    =>      desc,
                :title                  =>      link.inner_text
        }
       
end

Source at supersnippets.

I also extended this a bit to save the results into a database, using ActiveRecord, but since each db schema is a different world, I didn’t post that version here. If anybody thinks it might be useful just let me know.

Also, this code is not very rubyesque yet, suggestions in order to improve it will be really appreciated. I’m specially thinking about the if … else parts, I’m pretty sure there’s a way to shorten those lines :-)

20071005 Removing elements with Hpricot

Something like a month ago, a guy asked me how to remove elements with Hpricot. I told him I would look into it but it’s been a month already! So I hope I can compensate for the delay with this minitutorial on removing stuff with Hpricot! :-)

First I created a simple test page. It’s got some html elements, some have id’s, some contain certain text nodes. It looks like this:

<p>This is a paragraph without attributes</p>
        <p id="bad_attribute">This is a paragraph with one attribute: id=bad_attribute</p>
        <ul>
                <li>Element 1</li>
                <li>This will be removed because the text doesn't begin with an E</li>
        </ul>
        <ul id="second_list" style="border:1px solid red;">
                <li>Element 1 in the list with id=second_list</li>
                <li>element 2</li>
        </ul>

The question was how to remove certain individual elements given certain conditions – more specifically, when the element attributes matched a condition. I don’t see why he had problems removing stuff with the remove method, since that’s what I have used. Since search returns a collection of elements, you just need to get a collection which contains only the element you want to remove, and then apply remove to that collection.

Here are three examples:

Removing the paragraph with id = bad_attribute

We find out the element using CSS selectors, where the hash means ‘id’.

doc.search("p#bad_attribute").remove

Removing all the unordered lists (ul’s) which have an style attribute

Again, using CSS selectors:

doc.search("ul[@style]").remove

There’s more info about CSS selectors in the Hpricot CSS search documentation. One can get very creative with this and allows for filtering almost everything!

Removing elements whose contents match certain conditions

When it’s not enough with CSS selectors, we can perfectly take advantage of ruby!

For example, if you want to remove list items (li’s) whose text doesn’t begin with E, you could do it with this:

doc.search("li").collect!{|node|
        node if not /^E/.match(node.inner_text)
}.compact.remove

which is the same as saying:

  • Look for every list item in the document
  • Take the results of that search (which is an Array of Hpricot Elements) and apply the collect! function to them
  • collect! executes the code in the block for each element and stores the return value in an array
  • But as it can return nils (when the inner_text doesn’t begin with ‘E’ and hence doesn’t match our little regular expression), we remove nil values from the array with compact, so that we don’t get errors when removing.
  • And finally, remove the elements which are in the resulting array, with the classical Hpricot remove

Note how I used collect! instead of just collect, so that the changes are applied over the search results, and we don’t get a new array instead.

You should try using collect instead of collect!, and removing compact from the chain, to see what happens.

Final result

If one applies all these evil removals to the original code, the final result is this:

<p>This is a paragraph without attributes</p>
               
        <ul>
                <li>Element 1</li>
               
        </ul>

Pretty empty, isn’t it?!

Download these examples

I’ve uploaded the hpricot_remove_elements.rb and test.html together in a zip file: hpricot_remove_elements.zip. For running it, just unpack, and type ruby hpricot_remove_elements.rb

Or open with textmate and press Option+R ;-)

20070615 Extracting data with Hpricot

For those (few) of you which haven’t heard about it, Hpricot is a nice library for parsing HTML in ruby, created by the even nicer _whytheluckystiff, author of Poignant’s Guide to Ruby, Camping and other ruby gems (may you excuse the pun? it was impossible to avoid it).

Since I saw one demonstration by Rob McKinnon at certain LRUG meeting, I have been willing to try Hpricot, but I hadn’t seen an application for it yet. No more! I found myself today wanting to extract data from a table in a web page and suddenly I thought: this is a job for Hpricot!. More specifically, I wanted to extract these EXIF tags, and I simply couldn’t accept the mere thinking of entering that data manually. It needed to be automated!

Getting it

Getting Hpricot is very easy:

sudo gem install hpricot

(if you’re picky you can try more exotic ways of installing in its homepage).

gem install hpricot

if you’re in windows, of course.

Understanding it is easy as well, specially if you have used jquery before. It’s all about writing selectors for looking for things, so it helps a lot if the HTML document is well marked. Otherwise, you might have to end up doing lots of workarounds or extra code that could be avoided simply by having a class or id specified in the relevant elements.

Inspecting & traversing

So, once I got the library installed, I took a look at the page source code with Firebug. It is specially useful for this kind of jobs because it helps you to visualize the hierarchy of elements in the page, including classes and id’s, so you don’t have to traverse manually the HTML tree to gather the data you need.

What I was looking for was the table which contained the relevant data. In this case, we’re lucky and even if the table hasn’t got an id attribute which would make it uniquely identifiable in the whole document, it still has class=”inner”, which happens to be used only once in it, thus acting effectively as an element identifier.

Firebug in action!

Note how Firebug is showing the tree path for the selected table. If we didn’t have the class attribute, we would need to use a selector like “/html/body/blockquote/table/tbody/tr/td/table”, but it will be something as simple as “/table.inner”.

Hands on Ruby

Ok, so this is where we write a few lines of code which do a lot ;-)

First come the usual series of requires:

require 'rubygems'
require 'hpricot'
require 'open-uri'

Rubygems is required in order to load hpricot, and open-uri is required in order to directly read data from a URI. open-uri comes with ruby, so we don’t need to install anything else.

Now we need to get the HTML file. It is as simple as

doc = Hpricot(open("http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html"))

but since I was doing lots of tests and didn’t want to overload that guy’s server, I simply saved the document as EXIF.html and loaded it with this instead:

doc = open("EXIF.html") { |f| Hpricot(f) }

At this point we have the HTML document in the doc variable, so what are we waiting for?
We initialize a rows variable for holding the data that we’ll extract:

rows = []

And now comes the real fun!

(doc/"table.inner//tr").each do |row|
    cells = []
    (row/"td").each do |cell|
       
        if (cell/" span.s").length > 0
              values = (cell/"span.s").inner_html.split('<br />').collect{ |str|
              pair = str.strip.split('=').collect{|val| val.strip}
              Hash[pair[0], pair[1]]
            }
           
            if(values.length==1)
              cells < < cell.inner_text
            else
              cells << values
            end
           
        elsif
            cells << cell.inner_text
        end
    end
    rows << cells
   
end

Ok, not that fast. I’ll elaborate a little more on the juicy bits.

(doc/"table.inner//tr").each do |row|

This is the key for reaching the main data. It’s like saying I’m looking in doc for all the rows (the tr’s) which are contained in a table whose class equals ‘inner’. When we use a / it means we want an immediate child. // means a child below the element. As I said before, it’s all about selecting and traversing the tree.

With the last line of code, we get returned the content of each tr into the row variable. We can continue extracting data from within row, and that’s exactly what we do with

(row/"td").each do |cell|

That one provides us with all the td elements immediately below the current row.

When we reach the td elements, all that is left is to extract the data for each cell and push it into the cells array, which will be pushed into the rows array. But we don’t just copy the cell data as it is; some cells contain notes, and some of those notes contain lists of values. I think we can all agree that those lists of values are commonly called Hashes, and they undoubtedly deserve an special treatment!

if (cell/" span.s").length > 0

So that’s why I’m checking for the existance of an span with class == s inside each cell. If we find one, there’s a note in this row, and probably there’s one hash with values. I would say this is the funniest part of all:

values = (cell/"span.s").inner_html.split('<br />').collect{ |str|
  pair = str.strip.split('=').collect{|val| val.strip}
  Hash[pair[0], pair[1]]
}

I’m making use of the fact that each invoked function is returning another object, so that I can chain them consecutively instead of doing a series of assignments. And it reads like this: Take the html inside the span with class s, split it where you find a br, and for each of those split parts remove the surrounding whitespace and split it again where you find a =, so we get a pair of key-value values, remove the whitespace for those pairs as well and put them in a new Hash.

At the end we finish with an array of rows and cells, where certain cells occasionally contain a Hash with the constants used by the row EXIF tag.

It’s also interesting to note that the first row is unusable, because it corresponds to the th elements, so we’ll simply do a

rows.shift

and it’s gone. And to top it all, we could output the rows array to a yaml file, so that we do not need to run this each time we need the list of EXIF tags.

Arrays in ruby have a lovely method called to_yaml which dutifully generates a version of the array in yaml syntax. And it’s very easy to output that to a file:

File.open('hexif.yaml', 'w') { |f|
  f << rows.to_yaml
}

And you’re done! I hope you liked this small Hpricot tutorial/introduction… and if you have any suggestion or improvement please let me know!

Of course, you can get the complete source code here: hexif.rb. It is a ridiculous 61 lines, including some commented lines and white spaces. Come on get it and do something cool!

20070313 London Ruby Users Group brings you back to uni

After three failed attempts, I managed to go to yesterday’s lrug meeting. It was intended to be a kind of experimental collective code review, so people would contribute with pieces of code and get it dissected and improved collectively. There was an special obsession with Hashes, most of the code submissions were improvements and/or workarounds for the Hash class. I understand it. Hashes are cool! The other topic was using continuations for (I believe) solving sudokus. Backtracing and fibonacci were also mentioned in the session, and Rob McKinnon made one of his quick presentations, this time proposing a way of getting data from different sources into a generic shareable format (and using upcoming as an specific example, and hpricot and hashes, of course!).

I must say it was pretty interesting, even if I got lost at some points (my ruby knowledge is still too poor). I specially got lost with the continuations stuff, which at the same time brought me back uni memories, of those times in which I skipped some lessons and then went back to the classroom with lots of knowledge gaps and tried to follow the teacher (with no luck, usually). Hehe! But fortunately, this time the teacher was interesting and deserved to be listened to.

This reminded me as well of the beauty of programming and talking about pure concepts and abstractions. It was ages since I felt that, so thanks to all who did it possible. I think we all need a good dose of abstraction from time to time. Keeps the brain working.

One of the books which was strongly and fervourously recommended is Structure and Interpretation of Computer Programs, which I believe I read some years ago (again, in the uni :-)). So you can see, ruby is not about rails only!