Removing elements with Hpricot

Something like a month ago, a guy asked me how to remove elements with Hpricot. I told him I would look into it but it's been a month already! So I hope I can compensate for the delay with this minitutorial on removing stuff with Hpricot! :-)

First I created a simple test page. It's got some html elements, some have id's, some contain certain text nodes. It looks like this:


    <p>This is a paragraph without attributes</p>
    <p id="bad_attribute">This is a paragraph with one attribute: id=bad_attribute</p>
    <ul>
        <li>Element 1</li>
        <li>This will be removed because the text doesn't begin with an E</li>
    </ul>
    <ul id="second_list" style="border:1px solid red;">
        <li>Element 1 in the list with id=second_list</li>
        <li>element 2</li>
    </ul>

The question was how to remove certain individual elements given certain conditions - more specifically, when the element attributes matched a condition. I don't see why he had problems removing stuff with the remove method, since that's what I have used. Since search returns a collection of elements, you just need to get a collection which contains only the element you want to remove, and then apply remove to that collection.

Here are three examples:

Removing the paragraph with id = bad_attribute

We find out the element using CSS selectors, where the hash means 'id'.

doc.search("p#bad_attribute").remove

Removing all the unordered lists (ul's) which have an style attribute

Again, using CSS selectors:


doc.search("ul[@style]").remove

There's more info about CSS selectors in the Hpricot CSS search documentation. One can get very creative with this and allows for filtering almost everything!

Removing elements whose contents match certain conditions

When it's not enough with CSS selectors, we can perfectly take advantage of ruby!

For example, if you want to remove list items (li's) whose text doesn't begin with E, you could do it with this:


doc.search("li").collect!{|node|
    node if not /^E/.match(node.inner_text)
}.compact.remove

which is the same as saying:

  • Look for every list item in the document
  • Take the results of that search (which is an Array of Hpricot Elements) and apply the collect! function to them
  • collect! executes the code in the block for each element and stores the return value in an array
  • But as it can return nils (when the inner_text doesn't begin with 'E' and hence doesn't match our little regular expression), we remove nil values from the array with compact, so that we don't get errors when removing.
  • And finally, remove the elements which are in the resulting array, with the classical Hpricot remove

Note how I used collect! instead of just collect, so that the changes are applied over the search results, and we don't get a new array instead.

You should try using collect instead of collect!, and removing compact from the chain, to see what happens.

Final result

If one applies all these evil removals to the original code, the final result is this:


        <p>This is a paragraph without attributes</p>

    <ul>
        <li>Element 1</li>

    </ul>

Pretty empty, isn't it?!

Download these examples

I've uploaded the hpricot_remove_elements.rb and test.html together in a zip file: hpricot_remove_elements.zip. For running it, just unpack, and type ruby hpricot_remove_elements.rb

Or open with textmate and press Option+R ;-)