Posts Tagged ‘xpath’

20090710 “Blue Tuesday” sources released

Blue Tuesday by xplsv

Blue Tuesday” is a direct evolution from the codecolors code base. Since it was all done in a hurry, there were lots of things which didn’t work as expected. I somehow got rid of some of them when I ported the demo to mac, then I got rid of some extra things these days when I made the demo work in Linux, prior to releasing these sources. I am not happy with this code; with so many modifications it has grown way too much to be still readable. There are effects whose behaviour is not predictable, things like the ribbons are very inefficient and so on.

I’m still not completely happy with the synchronization. In the first version trace did a simple flash app in which he tapped a key each time an event happened (e.g. a snare hit) and then generated a list from there. But due to the way the demo is structured, I had to split the list in three parts so that I could check whether an event happened in each effect. Also with so many changes in the code, the synch had gone slightly awry and when I managed to compile the code in Linux, it clearly was asking for a revision.

So I thought about using Audacity for re-recording the synchronization. I would store each event as a label (Audacity has something called Label Tracks) and then exporting those labels as txt files and processing them to convert them into a .h file with all the events in an array. I had the idea of opening Audacity’s project with a text editor and found out that it was a simple XML file — which would make even easier the task of importing the labels’ information, without having to go through the step of exporting to a text file and parsing it.

I then tried to process the XML file with Python, but it was a horrible experience. Since it has a namespace on it and I wanted to do a simple XPath search, it seems I was doomed to fail without installing a couple of Python libraries — which I didn’t want to install. My main premise is to make my demos simple to compile, and having to install an extension just for parsing XML is not what I consider “simple”. So I resorted to good old PHP’s XML functions and in 20 minutes the import script was done. That was the easy part.

What was really painful was recording the synch points! Audacity took hanging itself up as a hobby. Thankfully, I have the CTRL+S tic –in which whenever I’m working with any program I tend to press CTRL+S regularly, pretty much each time I have typed in something, just in case– and that prevented me from losing my changes more than once. But it was just annoying to have to kill the program, open it again, say yes to “yes please recover the project I was working on when you decided to crash”. Even worse: at some point, the project got corrupted, and when I reopened it again, Audacity just got stuck switching between an sleep/idle status. I had to create a new project with the same mp3, save it and then edit the new project in a text editor, and copy and paste the old label tracks from the initial source project. Luckily it was all XML. I can’t imagine what would have I done if it had been binary data! (Probably scream a good lot!)

So lesson learnt: for making good synchronization you need a dedicated, and probably integrated, piece of software. If I need to do something like this again, I will probably spend a good amount of time in preparing something like the editor blackpawn created. Because it wasn’t only a problem of crashing, it was also the problem of not having a comfortable interface which lets you go back and forth the song, reduce its speed, etc, without having to use the mouse.

This is probably one of the reasons why I prefer to make demos with my own music: it’s easier to access the original song source files and build a list of events from there if I wanted to :-)

Enough chattering, go watch the demo or get a nice headache just by looking at the code :P

20070615 Extracting data with Hpricot

For those (few) of you which haven’t heard about it, Hpricot is a nice library for parsing HTML in ruby, created by the even nicer _whytheluckystiff, author of Poignant’s Guide to Ruby, Camping and other ruby gems (may you excuse the pun? it was impossible to avoid it).

Since I saw one demonstration by Rob McKinnon at certain LRUG meeting, I have been willing to try Hpricot, but I hadn’t seen an application for it yet. No more! I found myself today wanting to extract data from a table in a web page and suddenly I thought: this is a job for Hpricot!. More specifically, I wanted to extract these EXIF tags, and I simply couldn’t accept the mere thinking of entering that data manually. It needed to be automated!

Getting it

Getting Hpricot is very easy:

sudo gem install hpricot

(if you’re picky you can try more exotic ways of installing in its homepage).

gem install hpricot

if you’re in windows, of course.

Understanding it is easy as well, specially if you have used jquery before. It’s all about writing selectors for looking for things, so it helps a lot if the HTML document is well marked. Otherwise, you might have to end up doing lots of workarounds or extra code that could be avoided simply by having a class or id specified in the relevant elements.

Inspecting & traversing

So, once I got the library installed, I took a look at the page source code with Firebug. It is specially useful for this kind of jobs because it helps you to visualize the hierarchy of elements in the page, including classes and id’s, so you don’t have to traverse manually the HTML tree to gather the data you need.

What I was looking for was the table which contained the relevant data. In this case, we’re lucky and even if the table hasn’t got an id attribute which would make it uniquely identifiable in the whole document, it still has class=”inner”, which happens to be used only once in it, thus acting effectively as an element identifier.

Firebug in action!

Note how Firebug is showing the tree path for the selected table. If we didn’t have the class attribute, we would need to use a selector like “/html/body/blockquote/table/tbody/tr/td/table”, but it will be something as simple as “/table.inner”.

Hands on Ruby

Ok, so this is where we write a few lines of code which do a lot ;-)

First come the usual series of requires:

require 'rubygems'
require 'hpricot'
require 'open-uri'

Rubygems is required in order to load hpricot, and open-uri is required in order to directly read data from a URI. open-uri comes with ruby, so we don’t need to install anything else.

Now we need to get the HTML file. It is as simple as

doc = Hpricot(open("http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html"))

but since I was doing lots of tests and didn’t want to overload that guy’s server, I simply saved the document as EXIF.html and loaded it with this instead:

doc = open("EXIF.html") { |f| Hpricot(f) }

At this point we have the HTML document in the doc variable, so what are we waiting for?
We initialize a rows variable for holding the data that we’ll extract:

rows = []

And now comes the real fun!

(doc/"table.inner//tr").each do |row|
    cells = []
    (row/"td").each do |cell|
       
        if (cell/" span.s").length > 0
              values = (cell/"span.s").inner_html.split('<br />').collect{ |str|
              pair = str.strip.split('=').collect{|val| val.strip}
              Hash[pair[0], pair[1]]
            }
           
            if(values.length==1)
              cells < < cell.inner_text
            else
              cells << values
            end
           
        elsif
            cells << cell.inner_text
        end
    end
    rows << cells
   
end

Ok, not that fast. I’ll elaborate a little more on the juicy bits.

(doc/"table.inner//tr").each do |row|

This is the key for reaching the main data. It’s like saying I’m looking in doc for all the rows (the tr’s) which are contained in a table whose class equals ‘inner’. When we use a / it means we want an immediate child. // means a child below the element. As I said before, it’s all about selecting and traversing the tree.

With the last line of code, we get returned the content of each tr into the row variable. We can continue extracting data from within row, and that’s exactly what we do with

(row/"td").each do |cell|

That one provides us with all the td elements immediately below the current row.

When we reach the td elements, all that is left is to extract the data for each cell and push it into the cells array, which will be pushed into the rows array. But we don’t just copy the cell data as it is; some cells contain notes, and some of those notes contain lists of values. I think we can all agree that those lists of values are commonly called Hashes, and they undoubtedly deserve an special treatment!

if (cell/" span.s").length > 0

So that’s why I’m checking for the existance of an span with class == s inside each cell. If we find one, there’s a note in this row, and probably there’s one hash with values. I would say this is the funniest part of all:

values = (cell/"span.s").inner_html.split('<br />').collect{ |str|
  pair = str.strip.split('=').collect{|val| val.strip}
  Hash[pair[0], pair[1]]
}

I’m making use of the fact that each invoked function is returning another object, so that I can chain them consecutively instead of doing a series of assignments. And it reads like this: Take the html inside the span with class s, split it where you find a br, and for each of those split parts remove the surrounding whitespace and split it again where you find a =, so we get a pair of key-value values, remove the whitespace for those pairs as well and put them in a new Hash.

At the end we finish with an array of rows and cells, where certain cells occasionally contain a Hash with the constants used by the row EXIF tag.

It’s also interesting to note that the first row is unusable, because it corresponds to the th elements, so we’ll simply do a

rows.shift

and it’s gone. And to top it all, we could output the rows array to a yaml file, so that we do not need to run this each time we need the list of EXIF tags.

Arrays in ruby have a lovely method called to_yaml which dutifully generates a version of the array in yaml syntax. And it’s very easy to output that to a file:

File.open('hexif.yaml', 'w') { |f|
  f << rows.to_yaml
}

And you’re done! I hope you liked this small Hpricot tutorial/introduction… and if you have any suggestion or improvement please let me know!

Of course, you can get the complete source code here: hexif.rb. It is a ridiculous 61 lines, including some commented lines and white spaces. Come on get it and do something cool!