Parsing a del.icio.us export with Hpricot

The trickiest part is to detect if a bookmark has a corresponding description. The export is in the same format that Netscape used for its bookmarks export, which means it is a simple html file with a definition list (dl) and a series of definition terms (dt). A term (=bookmarks) may have a description (dd).

But how do you detect if there's a description? It seems the answer was rather simple: use term.next and if the next element's name is dd, we're lucky and have a description. The only problem was that I didn't know how to access the name of an element, until I just thought: what if I simply use name? and guess what... it worked! So term.next.name was exactly what I looked for :-)


require 'rubygems'
require 'hpricot'

doc = open("bookmarks.html") {|f| Hpricot(f) }

bookmarks = []

(doc/"dl/dt").each do |term|
    link = (term/"a")

    if term.next and term.next.name == 'dd'
        desc = term.next.inner_text
    else
        desc = nil
    end

    if link.attr('tags')
        tags = link.attr('tags').split(",")
    else
        tags = nil
    end

    bookmarks << {
        :address        =>    link.attr('href'),
        :created_at    =>    link.attr('last_visit'),
        :tags            =>    tags,
        :description    =>    desc,
        :title            =>    link.inner_text
    }

end

Source at supersnippets.

I also extended this a bit to save the results into a database, using ActiveRecord, but since each db schema is a different world, I didn't post that version here. If anybody thinks it might be useful just let me know.

Also, this code is not very rubyesque yet, suggestions in order to improve it will be really appreciated. I'm specially thinking about the if ... else parts, I'm pretty sure there's a way to shorten those lines :-)