Parsing a del.icio.us export with Hpricot
The trickiest part is to detect if a bookmark has a corresponding description. The export is in the same format that Netscape used for its bookmarks export, which means it is a simple html file with a definition list (dl) and a series of definition terms (dt). A term (=bookmarks) may have a description (dd).
But how do you detect if there's a description? It seems the answer was rather simple: use term.next and if the next element's name is dd, we're lucky and have a description. The only problem was that I didn't know how to access the name of an element, until I just thought: what if I simply use name? and guess what... it worked! So term.next.name was exactly what I looked for :-)
require 'rubygems'
require 'hpricot'
doc = open("bookmarks.html") {|f| Hpricot(f) }
bookmarks = []
(doc/"dl/dt").each do |term|
link = (term/"a")
if term.next and term.next.name == 'dd'
desc = term.next.inner_text
else
desc = nil
end
if link.attr('tags')
tags = link.attr('tags').split(",")
else
tags = nil
end
bookmarks << {
:address => link.attr('href'),
:created_at => link.attr('last_visit'),
:tags => tags,
:description => desc,
:title => link.inner_text
}
end
Source at supersnippets.
I also extended this a bit to save the results into a database, using ActiveRecord, but since each db schema is a different world, I didn't post that version here. If anybody thinks it might be useful just let me know.
Also, this code is not very rubyesque yet, suggestions in order to improve it will be really appreciated. I'm specially thinking about the if ... else parts, I'm pretty sure there's a way to shorten those lines :-)