soledad penadés
repeat 4[fd 100 rt 90]

Archive for the ‘ruby’ Category

20080325 Parsing a del.icio.us export with Hpricot

The trickiest part is to detect if a bookmark has a corresponding description. The export is in the same format that Netscape used for its bookmarks export, which means it is a simple html file with a definition list (dl) and a series of definition terms (dt). A term (=bookmarks) may have a description (dd).

But how do you detect if there's a description? It seems the answer was rather simple: use term.next and if the next element's name is dd, we're lucky and have a description. The only problem was that I didn't know how to access the name of an element, until I just thought: what if I simply use name? and guess what… it worked! So term.next.name was exactly what I looked for :-)

require 'rubygems'
require 'hpricot'

doc = open("delicious.html") {|f| Hpricot(f) }

bookmarks = []

(doc/"dl/dt").each do |term|
        link = (term/"a")
       
        if term.next and term.next.name == 'dd'
                desc = term.next.inner_text
        else
                desc = nil
        end
       
        if link.attr('tags')
                tags = link.attr('tags').split(",")
        else
                tags = nil
        end
       
        bookmarks << {
                :address                =>      link.attr('href'),
                :created_at     =>   link.attr('last_visit'),
                :tags         => tags,
                :description    =>  desc,
                :title      =>        link.inner_text
        }
       
end

Source at supersnippets.

I also extended this a bit to save the results into a database, using ActiveRecord, but since each db schema is a different world, I didn't post that version here. If anybody thinks it might be useful just let me know.

Also, this code is not very rubyesque yet, suggestions in order to improve it will be really appreciated. I'm specially thinking about the if … else parts, I'm pretty sure there's a way to shorten those lines :-)

20071005 Removing elements with Hpricot

Something like a month ago, a guy asked me how to remove elements with Hpricot. I told him I would look into it but it's been a month already! So I hope I can compensate for the delay with this minitutorial on removing stuff with Hpricot! :-)

First I created a simple test page. It's got some html elements, some have id's, some contain certain text nodes. It looks like this:

<p>This is a paragraph without attributes</p>
        <p id="bad_attribute">This is a paragraph with one attribute: id=bad_attribute</p>
        <ul>
                <li>Element 1</li>
                <li>This will be removed because the text doesn't begin with an E</li>
        </ul>
        <ul id="second_list" style="border:1px solid red;">
                <li>Element 1 in the list with id=second_list</li>
                <li>element 2</li>
        </ul>

The question was how to remove certain individual elements given certain conditions - more specifically, when the element attributes matched a condition. I don't see why he had problems removing stuff with the remove method, since that's what I have used. Since search returns a collection of elements, you just need to get a collection which contains only the element you want to remove, and then apply remove to that collection.

Here are three examples:

Removing the paragraph with id = bad_attribute

We find out the element using CSS selectors, where the hash means 'id'.

doc.search("p#bad_attribute").remove

Removing all the unordered lists (ul's) which have an style attribute

Again, using CSS selectors:

doc.search("ul[@style]").remove

There's more info about CSS selectors in the Hpricot CSS search documentation. One can get very creative with this and allows for filtering almost everything!

Removing elements whose contents match certain conditions

When it's not enough with CSS selectors, we can perfectly take advantage of ruby!

For example, if you want to remove list items (li's) whose text doesn't begin with E, you could do it with this:

doc.search("li").collect!{|node|
        node if not /^E/.match(node.inner_text)
}.compact.remove

which is the same as saying:

  • Look for every list item in the document
  • Take the results of that search (which is an Array of Hpricot Elements) and apply the collect! function to them
  • collect! executes the code in the block for each element and stores the return value in an array
  • But as it can return nils (when the inner_text doesn't begin with 'E' and hence doesn't match our little regular expression), we remove nil values from the array with compact, so that we don't get errors when removing.
  • And finally, remove the elements which are in the resulting array, with the classical Hpricot remove

Note how I used collect! instead of just collect, so that the changes are applied over the search results, and we don't get a new array instead.

You should try using collect instead of collect!, and removing compact from the chain, to see what happens.

Final result

If one applies all these evil removals to the original code, the final result is this:

<p>This is a paragraph without attributes</p>
               
        <ul>
                <li>Element 1</li>
               
        </ul>

Pretty empty, isn't it?!

Download these examples

I've uploaded the hpricot_remove_elements.rb and test.html together in a zip file: hpricot_remove_elements.zip. For running it, just unpack, and type ruby hpricot_remove_elements.rb

Or open with textmate and press Option+R ;-)

20070627 Superminigallery: a gallery with ruby, rmagick and builder

Imagine you have been in a very nice place for holidays. You took a lot of pictures and want to show them to your family and friends, but you don't feel like using services like flickr or programs like iPhoto. You just want to put them in your own server and give the url to your friends.

What can you do? Well, you could do like me and create a little script to generate an HTML file, with thumbnails and even watermarked images (just in case some creepy individual decides to use your stuff without asking first).

Superminigallery thumbnail

Requirements

This script requires a couple of gems to be installed: RMagick and builder (but if you've done some stuff with Rails you might already have them). RMagick is used for dealing with the images and builder is used for generating the XHTML. This is because I didn't want to write any html by hand, with their less than and greater than signs, attributes, etc.

Using it

  1. Create a folder in your computer. For example: holidays.
  2. Then you copy there the pictures you want to show to the world.
  3. Open a terminal and cd to that directory. E.g.
    cd ~/Desktop/holidays
  4. Execute the script! E.g.
    ruby ~/code/superminigallery.rb
  5. Wait until it finishes

When it finishes you'll find there's an output folder in the holidays folder. That's where the index.html file, as well as all the thumbnails and watermarked images are. Simply upload the contents of this folder to your host and let everybody know about it!

Ok, but show me the code

The first lines act like a configuration area. You can change the output folder name, so that it is called superoutput, gallery, whatever you like (as long as it is a valid path name).

You may change the sizes of the generated pictures; these sizes are defined in the versions variable. Each pair means [width, height]. For example, the thumbnails are 300 pixels wide and 150 pixels high.

output_path = 'output'

versions = {
  'thumbnail' =>  [300,150],
  'big'       =>  [1024,768]
}

You can also configure which EXIF tags need to be retrieved. Since their names are too obscure for non-technical savvy people I decided to create this hash for storing the key (Exif tag) and the nice name to show with the value. So instead of showing DateTimeOriginal, it will simply output Taken.

exif_fields = {
  'Taken'     =>  'DateTimeOriginal',
  'Camera'    =>  'Model',
  'Exposure'  =>  'ExposureTime', 
  'Shutter Speed' =>  'ShutterSpeedValue'
}

There are way more tags you could show, but they can be confusing for normal people and only entertain geeks, so it's better to keep them down to a minimum.

Declare the builder object, and initialize it with the XHTML header.

x = Builder::XmlMarkup.new(:target=>xhtml, :indent=>1)

x.instruct!
x.declare! :DOCTYPE, :html, :PUBLIC, "-//W3C//DTD XHTML 1.0 Strict//EN", "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
x.html( "xmlns" => "http://www.w3.org/1999/xhtml" ) {

Now it would be amazing to have some styling in the page so that it doesn't look so ugly. We can put an style tag inside the head, and use the text! method for adding literal text to the builder object:

x.style( "type"=>"text/css" ) { x.text! "
      body{
        font-family:georgia,serif
      }
     
      h1,h2 {
        margin-top: 0;
      }
      …
      "
}

(the means there's more code but I have reduced it for clarity purposes)

Now, we need to create the output directory. I haven't bothered with outputting error messages if the directory already exists or anything. It will always try to create it:

begin
    FileUtils.mkdir output_path
  rescue
  end

We need to create a Magick::Draw object for watermarking the images, and define its parameters:

draw = Magick::Draw.new
draw.gravity = Magick::CenterGravity
draw.pointsize = 64
draw.font_family = "Helvetica"
draw.font_weight = Magick::BoldWeight
draw.stroke = 'none'
draw.fill = "#ffffff99"

Basically we are saying: use Helvetica bold 64pt, painting it with white (ffffff) and some transparency (99 for alpha channel). If you don't have Helvetica installed in your system, replace it with your favourite font.

(But since 2007 is Helvetica's 50th anniversary, you should do everything possible to use Helvetica!)

Now we open the current directory (where the script was executed) and find all files with jpg and JPG extensions, and sort them. That's because sometimes the images don't get listed in alphabetical order, and us humans like to see things in sequential order. Specially because they usually are numbered incrementally, and older numbers mean older images, so IMG001 should appear before IMG100.

Dir['*.jpg','*.JPG'].sort.each do |f|

Read each file into a Magick::Image object:

img = Magick::Image.read(f).first

And for each version…

versions.each do |k,v|

… create the version filename by appending the version name to the filename, like big_IMG_1234.jpg, and the output filename, by prepending the output path to the version filename:

version_file =  k + '_' + f
output_img_path = File.join(output_path, version_file)

If the version is 'thumbnail', we'll add the image metadata to the builder object. Note how you don't need to open or close tags, but just include things in blocks or parenthesis to get the mark up done.

if(k=='thumbnail')
        x.div('class'=>'picture') {
                x.h2(f)
                x.a('href'=> version_file.sub('thumbnail_', 'big_')) {
                        x.img('src'=>version_file)
                }
                x.dl {
                        x.dt('Dimensions')
                        x.dd(img.columns.to_s + ' x ' + img.rows.to_s)

                        exif_fields.each do |title, field|
                                key = "Exif:#{field}"
                                if img[key]!=nil
                                        x.dt(title)
                                        x.dd(img[key])
                                end
                        end
                }
        }
end

Resizing the image is as simple as

version = img.crop_resized(v[0], v[1])

crop_resized returns another Image object which we store in the version variable.

Now, if we are dealing with the 'big' version, we'll add the watermark that we prepared at the beginning. That is done with

if(k=='big')
        draw.annotate(version, 0, 0, 0, 0, "(c) soledadpenades.com")
end

You can replace my (c) soledadpenades.com with your text, of course!

And for writing the resulting image to disk:

version.write output_img_path

Very very important: do not forget to call the Garbage Collector. For some reason which I still haven't been able to elucidate, the RMagick gem leaks memory furiously. So if you forget to do a

GC.start

as I did with the first version of the script, your computer will mostly hung if you make it generate a lot of thumbnails. If you look at the current processes with top or a similar tool, you'll find a ruby process eating more and more memory with each picture it processes.

And finally, we just need to output the generated XHTML to index.html:

File.open(File.join(output_path, 'index.html'), 'w+') do |file|
  file.puts xhtml
end

Here's the result and here's the source code. With only 120 lines of code (excluding the license text :D), it's way easy to modify to suit your tastes.

Don't tell anyone but…

I must confess I got the inspiration for this from herotyc's jGallery. But he used a bash script and I thought there should be a way of doing the same with ruby :-)

20070615 Extracting data with Hpricot

For those (few) of you which haven't heard about it, Hpricot is a nice library for parsing HTML in ruby, created by the even nicer _whytheluckystiff, author of Poignant's Guide to Ruby, Camping and other ruby gems (may you excuse the pun? it was impossible to avoid it).

Since I saw one demonstration by Rob McKinnon at certain LRUG meeting, I have been willing to try Hpricot, but I hadn't seen an application for it yet. No more! I found myself today wanting to extract data from a table in a web page and suddenly I thought: this is a job for Hpricot!. More specifically, I wanted to extract these EXIF tags, and I simply couldn't accept the mere thinking of entering that data manually. It needed to be automated!

Getting it

Getting Hpricot is very easy:

sudo gem install hpricot

(if you're picky you can try more exotic ways of installing in its homepage).

gem install hpricot

if you're in windows, of course.

Understanding it is easy as well, specially if you have used jquery before. It's all about writing selectors for looking for things, so it helps a lot if the HTML document is well marked. Otherwise, you might have to end up doing lots of workarounds or extra code that could be avoided simply by having a class or id specified in the relevant elements.

Inspecting & traversing

So, once I got the library installed, I took a look at the page source code with Firebug. It is specially useful for this kind of jobs because it helps you to visualize the hierarchy of elements in the page, including classes and id's, so you don't have to traverse manually the HTML tree to gather the data you need.

What I was looking for was the table which contained the relevant data. In this case, we're lucky and even if the table hasn't got an id attribute which would make it uniquely identifiable in the whole document, it still has class="inner", which happens to be used only once in it, thus acting effectively as an element identifier.

Firebug in action!

Note how Firebug is showing the tree path for the selected table. If we didn't have the class attribute, we would need to use a selector like "/html/body/blockquote/table/tbody/tr/td/table", but it will be something as simple as "/table.inner".

Hands on Ruby

Ok, so this is where we write a few lines of code which do a lot ;-)

First come the usual series of requires:

require 'rubygems'
require 'hpricot'
require 'open-uri'

Rubygems is required in order to load hpricot, and open-uri is required in order to directly read data from a URI. open-uri comes with ruby, so we don't need to install anything else.

Now we need to get the HTML file. It is as simple as

doc = Hpricot(open("http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html"))

but since I was doing lots of tests and didn't want to overload that guy's server, I simply saved the document as EXIF.html and loaded it with this instead:

doc = open("EXIF.html") { |f| Hpricot(f) }

At this point we have the HTML document in the doc variable, so what are we waiting for?
We initialize a rows variable for holding the data that we'll extract:

rows = []

And now comes the real fun!

(doc/"table.inner//tr").each do |row|
    cells = []
    (row/"td").each do |cell|
       
        if (cell/" span.s").length > 0
              values = (cell/"span.s").inner_html.split('<br />').collect{ |str|
              pair = str.strip.split('=').collect{|val| val.strip}
              Hash[pair[0], pair[1]]
            }
           
            if(values.length==1)
              cells < < cell.inner_text
            else
              cells << values
            end
           
        elsif
            cells << cell.inner_text
        end
    end
    rows << cells
   
end

Ok, not that fast. I'll elaborate a little more on the juicy bits.

(doc/"table.inner//tr").each do |row|

This is the key for reaching the main data. It's like saying I'm looking in doc for all the rows (the tr's) which are contained in a table whose class equals 'inner'. When we use a / it means we want an immediate child. // means a child below the element. As I said before, it's all about selecting and traversing the tree.

With the last line of code, we get returned the content of each tr into the row variable. We can continue extracting data from within row, and that's exactly what we do with

(row/"td").each do |cell|

That one provides us with all the td elements immediately below the current row.

When we reach the td elements, all that is left is to extract the data for each cell and push it into the cells array, which will be pushed into the rows array. But we don't just copy the cell data as it is; some cells contain notes, and some of those notes contain lists of values. I think we can all agree that those lists of values are commonly called Hashes, and they undoubtedly deserve an special treatment!

if (cell/" span.s").length > 0

So that's why I'm checking for the existance of an span with class == s inside each cell. If we find one, there's a note in this row, and probably there's one hash with values. I would say this is the funniest part of all:

values = (cell/"span.s").inner_html.split('<br />').collect{ |str|
  pair = str.strip.split('=').collect{|val| val.strip}
  Hash[pair[0], pair[1]]
}

I'm making use of the fact that each invoked function is returning another object, so that I can chain them consecutively instead of doing a series of assignments. And it reads like this: Take the html inside the span with class s, split it where you find a br, and for each of those split parts remove the surrounding whitespace and split it again where you find a =, so we get a pair of key-value values, remove the whitespace for those pairs as well and put them in a new Hash.

At the end we finish with an array of rows and cells, where certain cells occasionally contain a Hash with the constants used by the row EXIF tag.

It's also interesting to note that the first row is unusable, because it corresponds to the th elements, so we'll simply do a

rows.shift

and it's gone. And to top it all, we could output the rows array to a yaml file, so that we do not need to run this each time we need the list of EXIF tags.

Arrays in ruby have a lovely method called to_yaml which dutifully generates a version of the array in yaml syntax. And it's very easy to output that to a file:

File.open('hexif.yaml', 'w') { |f|
  f << rows.to_yaml
}

And you're done! I hope you liked this small Hpricot tutorial/introduction… and if you have any suggestion or improvement please let me know!

Of course, you can get the complete source code here: hexif.rb. It is a ridiculous 61 lines, including some commented lines and white spaces. Come on get it and do something cool!