<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>soledad penadés &#187; hpricot</title>
	<atom:link href="http://soledadpenades.com/tag/hpricot/feed/" rel="self" type="application/rss+xml" />
	<link>http://soledadpenades.com</link>
	<description>repeat 4[fd 100 rt 90]</description>
	<lastBuildDate>Wed, 25 Apr 2012 21:10:33 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>A first impression on Ruby&#8217;s Mechanize</title>
		<link>http://soledadpenades.com/2012/03/17/a-first-impression-on-rubys-mechanize/</link>
		<comments>http://soledadpenades.com/2012/03/17/a-first-impression-on-rubys-mechanize/#comments</comments>
		<pubDate>Sat, 17 Mar 2012 11:47:29 +0000</pubDate>
		<dc:creator>sole</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[hpricot]]></category>
		<category><![CDATA[irb]]></category>
		<category><![CDATA[mechanize]]></category>
		<category><![CDATA[nokogiri]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[screen scrapping]]></category>

		<guid isPermaLink="false">http://soledadpenades.com/?p=3927</guid>
		<description><![CDATA[TL;DR: If you need to do screen scrapping, use ruby's Mechanize.]]></description>
			<content:encoded><![CDATA[<p>I had to do some screen scrapping yesterday and, while I previously have used <a href="https://github.com/hpricot/hpricot">hpricot</a> for these kinds of things (or maybe even just plain <a href="http://curl.haxx.se/">cURL</a> or similar), this time the page required me to be logged in, and not using any HTTP &#8220;standard&#8221; way of Auth that cURL could deal with, but a custom CMS html-based login form. So I understood that I needed something more powerful; something that could pass as a &#8220;proper&#8221; browser and not just a simple crawler.</p>
<p>I had heard quite nice successful experiences with Mechanize before, so I decided I would give it a try, and it turned out to vastly exceed what I expected from it! :-)</p>
<p><a href="http://mechanize.rubyforge.org/">Mechanize</a> is a port of the original <a href="http://search.cpan.org/dist/WWW-Mechanize/">Perl  Mechanize</a> library; there is a <a href="http://wwwsearch.sourceforge.net/mechanize/">Python port</a> too, and there might be ports for other languages too but I wasn&#8217;t interested in that.</p>
<p>For some reason my mind was in the ruby-mood yesterday, so I decided to go for the ruby port. Everything was flowing nicely between ruby and me. I even dared prototyping what I wanted to do in a terminal, using <a href="http://en.wikipedia.org/wiki/Interactive_Ruby_Shell">irb</a>, before writing the final script I needed to write. My ruby muscle was getting toned again, so to speak.</p>
<p>I got some false starts, though. I first invoked just &#8216;ruby&#8217; in the terminal, only to be greeted with a waiting pipe, i.e. nothing happened as &#8211;I guess&#8211; ruby was expecting me to provide him with something to run. I remembered that the interactive ruby executable was called irb. A Control+C and four other keystrokes later, I was in immediate-mode ruby.</p>
<p>Another gotcha: not having to
<div class="syhi_block"><code><span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'rubygems'</span></code></div>
<p> when running in irb, but having to when writing a script. Thankfully that I remembered and was able to fix quickly.</p>
<p>But I&#8217;m digressing. Back to Mechanize: it simulates a real browser interacting with a website. You can even fake the user agent. But unlike the most basic screen-scrappers, which simply download a page and then do something with it and then download another one and do something else, without keeping any sort of continuity between downloads and connections, Mechanize is stateful. Which means that it keeps the state between &#8216;visited&#8217; pages. To all effects, and for the visited websites, there&#8217;s a normal browser at the other end of the line. Unless you go crazy and begin hammering a server with tons of requests, crawling websites this way might be almost invisible to servers.</p>
<p>Which is exactly what I wanted!</p>
<p>The syntax is quite nice and intuitive. Borrowing shamelessly from the manual:</p>
<div class="syhi_block"><code><span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'rubygems'</span><br />
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'mechanize'</span><br />
<br />
agent = Mechanize.<span style="color:#9900CC;">new</span><br />
page = agent.<span style="color:#9900CC;">get</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'http://google.com/'</span><span style="color:#006600; font-weight:bold;">&#41;</span><br />
<br />
<span style="color:#008000; font-style:italic;"># List links on the page</span><br />
page.<span style="color:#9900CC;">links</span>.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>link<span style="color:#006600; font-weight:bold;">|</span><br />
&nbsp; <span style="color:#CC0066; font-weight:bold;">puts</span> link.<span style="color:#9900CC;">text</span>, link.<span style="color:#9900CC;">href</span><br />
<span style="color:#9966CC; font-weight:bold;">end</span><br />
<br />
<span style="color:#008000; font-style:italic;"># Click on the first link with text 'News'</span><br />
page = agent.<span style="color:#9900CC;">page</span>.<span style="color:#9900CC;">link_with</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#ff3333; font-weight:bold;">:text</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#996600;">'News'</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">click</span><br />
<br />
<span style="color:#008000; font-style:italic;"># Do a Google search, using the search form</span><br />
google_form = page.<span style="color:#9900CC;">form</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'f'</span><span style="color:#006600; font-weight:bold;">&#41;</span><br />
google_form.<span style="color:#9900CC;">q</span> = <span style="color:#996600;">'ruby mechanize'</span><br />
page = agent.<span style="color:#9900CC;">submit</span><span style="color:#006600; font-weight:bold;">&#40;</span>google_form, google_form.<span style="color:#9900CC;">buttons</span>.<span style="color:#9900CC;">first</span><span style="color:#006600; font-weight:bold;">&#41;</span></code></div>
<p>Something that made me even more happier is that Mechanize also uses <a href="http://nokogiri.org/">Nokogiri</a> internally for parsing HTML&#8211;which means we can do the same style of nice DOM tree traversing that I used to do with Hpricot, only even better! (Nokogiri is the successor to Hpricot).</p>
<p>I didn&#8217;t need that last feature for this particular case, but it left me wondering what I could use Mechanize for&#8211;in order to use this! I will have a look at my TODO list and see where can I use my newly acquired Mechanize skills!</p>
 <p><a href="http://soledadpenades.com/?flattrss_redirect&amp;id=3927&amp;md5=052672dc02fda8f87b6b023a6e61ab68" title="Flattr" target="_blank"><img src="http://soledadpenades.com/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://soledadpenades.com/2012/03/17/a-first-impression-on-rubys-mechanize/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<atom:link rel="payment" href="https://flattr.com/submit/auto?user_id=8399&amp;amp;url=http%3A%2F%2Fsoledadpenades.com%2F2012%2F03%2F17%2Fa-first-impression-on-rubys-mechanize%2F&amp;amp;language=en_GB&amp;amp;category=text&amp;amp;title=A+first+impression+on+Ruby%26%238217%3Bs+Mechanize&amp;amp;description=I+had+to+do+some+screen+scrapping+yesterday+and%2C+while+I+previously+have+used+hpricot+for+these+kinds+of+things+%28or+maybe+even+just+plain+cURL+or+similar%29%2C+this+time+the+page+required+me+to+be+logged+in%2C+and+not+using+any+HTTP+%22standard%22+way+of+Auth+that+cURL+could+deal+with%2C+but+a+custom+CMS+html-based+login+form.+So+I+understood+that+I+needed+something+more+powerful%3B+something+that+could+pass+as+a+%22proper%22+browser+and+not+just+a+simple+crawler.%0D%0A%0D%0AI+had+heard+quite+nice+successful+experiences+with+Mechanize+before%2C+so+I+decided+I+would+give+it+a+try%2C+and+it+turned+out+to+vastly+exceed+what+I+expected+from+it%21+%3A-%29%0D%0A%0D%0AMechanize+is+a+port+of+the+original+Perl++Mechanize+library%3B+there+is+a+Python+port+too%2C+and+there+might+be+ports+for+other+languages+too+but+I+wasn%27t+interested+in+that.%0D%0A%0D%0AFor+some+reason+my+mind+was+in+the+ruby-mood+yesterday%2C+so+I+decided+to+go+for+the+ruby+port.+Everything+was+flowing+nicely+between+ruby+and+me.+I+even+dared+prototyping+what+I+wanted+to+do+in+a+terminal%2C+using+irb%2C+before+writing+the+final+script+I+needed+to+write.+My+ruby+muscle+was+getting+toned+again%2C+so+to+speak.%0D%0A%0D%0AI+got+some+false+starts%2C+though.+I+first+invoked+just+%27ruby%27+in+the+terminal%2C+only+to+be+greeted+with+a+waiting+pipe%2C+i.e.+nothing+happened+as+--I+guess--+ruby+was+expecting+me+to+provide+him+with+something+to+run.+I+remembered+that+the+interactive+ruby+executable+was+called+irb.+A+Control%2BC+and+four+other+keystrokes+later%2C+I+was+in+immediate-mode+ruby.%0D%0A%0D%0AAnother+gotcha%3A+not+having+to+require+%27rubygems%27+when+running+in+irb%2C+but+having+to+when+writing+a+script.+Thankfully+that+I+remembered+and+was+able+to+fix+quickly.%0D%0A%0D%0ABut+I%27m+digressing.+Back+to+Mechanize%3A+it+simulates+a+real+browser+interacting+with+a+website.+You+can+even+fake+the+user+agent.+But+unlike+the+most+basic+screen-scrappers%2C+which+simply+download+a+page+and+then+do+something+with+it+and+then+download+another+one+and+do+something+else%2C+without+keeping+any+sort+of+continuity+between+downloads+and+connections%2C+Mechanize+is+stateful.+Which+means+that+it+keeps+the+state+between+%27visited%27+pages.+To+all+effects%2C+and+for+the+visited+websites%2C+there%27s+a+normal+browser+at+the+other+end+of+the+line.+Unless+you+go+crazy+and+begin+hammering+a+server+with+tons+of+requests%2C+crawling+websites+this+way+might+be+almost+invisible+to+servers.%0D%0A%0D%0AWhich+is+exactly+what+I+wanted%21%0D%0A%0D%0AThe+syntax+is+quite+nice+and+intuitive.+Borrowing+shamelessly+from+the+manual%3A%0D%0A%0D%0A%0D%0Arequire+%27rubygems%27%0D%0Arequire+%27mechanize%27%0D%0A%0D%0Aagent+%3D+Mechanize.new%0D%0Apage+%3D+agent.get%28%27http%3A%2F%2Fgoogle.com%2F%27%29%0D%0A%0D%0A%23+List+links+on+the+page%0D%0Apage.links.each+do+%7Clink%7C%0D%0A++puts+link.text%2C+link.href%0D%0Aend%0D%0A%0D%0A%23+Click+on+the+first+link+with+text+%27News%27%0D%0Apage+%3D+agent.page.link_with%28%3Atext+%3D%3E+%27News%27%29.click%0D%0A%0D%0A%23%C2%A0Do+a+Google+search%2C+using+the+search+form%0D%0Agoogle_form+%3D+page.form%28%27f%27%29%0D%0Agoogle_form.q+%3D+%27ruby+mechanize%27%0D%0Apage+%3D+agent.submit%28google_form%2C+google_form.buttons.first%29%0D%0A%0D%0A%0D%0ASomething+that+made+me+even+more+happier+is+that+Mechanize+also+uses+Nokogiri+internally+for+parsing+HTML--which+means+we+can+do+the+same+style+of+nice+DOM+tree+traversing+that+I+used+to+do+with+Hpricot%2C+only+even+better%21+%28Nokogiri+is+the+successor+to+Hpricot%29.%0D%0A%0D%0AI+didn%27t+need+that+last+feature+for+this+particular+case%2C+but+it+left+me+wondering+what+I+could+use+Mechanize+for--in+order+to+use+this%21+I+will+have+a+look+at+my+TODO+list+and+see+where+can+I+use+my+newly+acquired+Mechanize+skills%21&amp;amp;tags=hpricot%2Cirb%2Cmechanize%2Cnokogiri%2Cruby%2Cscreen+scrapping%2Cblog" type="text/html" />
	</item>
		<item>
		<title>How to install hpricot in Ubuntu 8.4</title>
		<link>http://soledadpenades.com/2008/10/24/how-to-install-hpricot-in-ubuntu-84/</link>
		<comments>http://soledadpenades.com/2008/10/24/how-to-install-hpricot-in-ubuntu-84/#comments</comments>
		<pubDate>Fri, 24 Oct 2008 09:56:59 +0000</pubDate>
		<dc:creator>sole</dc:creator>
				<category><![CDATA[Software]]></category>
		<category><![CDATA[hpricot]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[ubuntu]]></category>

		<guid isPermaLink="false">http://soledadpenades.com/?p=778</guid>
		<description><![CDATA[This could be considered a fresh installation, speaking in ruby terms. I just had ruby installed, no ruby gems, nor ruby dev nor anything else ruby. So this should be enough for installing hpricot as well as ruby gems (which are required for installing hpricot). As you can see, I didn&#8217;t download any source file, [...]]]></description>
			<content:encoded><![CDATA[<p>This could be considered a fresh installation, speaking in ruby terms. I just had ruby installed, no ruby gems, nor ruby dev nor anything else ruby. So this should be enough for installing hpricot as well as ruby gems (which are required for installing hpricot). </p>
<p>As you can see, I didn&#8217;t download any source file, instead I was happy with using apt-get and the hpricot version from ubuntu repositories, although they are relatively old (for example rubygems is more than a year old). If I find any problem and need to update to newer versions I&#8217;ll report that here ;-)</p>
<div class="syhi_block"><code><span style="color: #c20cb9; font-weight: bold;">sudo</span> <span style="color: #c20cb9; font-weight: bold;">apt-get</span> <span style="color: #c20cb9; font-weight: bold;">install</span> rubygems<br />
<span style="color: #c20cb9; font-weight: bold;">sudo</span> <span style="color: #c20cb9; font-weight: bold;">rm</span> <span style="color: #000000; font-weight: bold;">/</span>var<span style="color: #000000; font-weight: bold;">/</span>lib<span style="color: #000000; font-weight: bold;">/</span>gems<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">1.8</span><span style="color: #000000; font-weight: bold;">/</span>source_cache<br />
<span style="color: #c20cb9; font-weight: bold;">sudo</span> gem update<br />
<span style="color: #c20cb9; font-weight: bold;">sudo</span> <span style="color: #c20cb9; font-weight: bold;">apt-get</span> <span style="color: #c20cb9; font-weight: bold;">install</span> ruby1.8-dev<br />
<span style="color: #c20cb9; font-weight: bold;">sudo</span> gem <span style="color: #c20cb9; font-weight: bold;">install</span> hpricot</code></div>
<p>It&#8217;s a pity they don&#8217;t have a metapackage for ruby&#8217;s development files (the ruby1.8-dev package), the same way there&#8217;s a <strong>ruby</strong> metapackage which depends on the <strong>ruby1.8</strong> package, so whenever ruby is updated it will update the ruby version as well, without the user having to worry about the version number.</p>
<p>Even more, I instinctively tried a <em>naive</em><strong> sudo apt-get install rubydev</strong> and was greeted with a sad<em> &#8220;Couldn&#8217;t find package rubydev&#8221;</em>. It somehow proves that a metapackage called rubydev would be quite useful&#8230; at least for instinctive users.</p>
<p>Enjoy your screen scrapping!</p>
 <p><a href="http://soledadpenades.com/?flattrss_redirect&amp;id=778&amp;md5=9d485ac144e0ac71ee31f1dc093ea7a5" title="Flattr" target="_blank"><img src="http://soledadpenades.com/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://soledadpenades.com/2008/10/24/how-to-install-hpricot-in-ubuntu-84/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<atom:link rel="payment" href="https://flattr.com/submit/auto?user_id=8399&amp;amp;url=http%3A%2F%2Fsoledadpenades.com%2F2008%2F10%2F24%2Fhow-to-install-hpricot-in-ubuntu-84%2F&amp;amp;language=en_GB&amp;amp;category=text&amp;amp;title=How+to+install+hpricot+in+Ubuntu+8.4&amp;amp;description=This+could+be+considered+a+fresh+installation%2C+speaking+in+ruby+terms.+I+just+had+ruby+installed%2C+no+ruby+gems%2C+nor+ruby+dev+nor+anything+else+ruby.+So+this+should+be+enough+for+installing+hpricot+as+well+as+ruby+gems+%28which+are+required+for+installing+hpricot%29.+%0D%0A%0D%0AAs+you+can+see%2C+I+didn%27t+download+any+source+file%2C+instead+I+was+happy+with+using+apt-get+and+the+hpricot+version+from+ubuntu+repositories%2C+although+they+are+relatively+old+%28for+example+rubygems+is+more+than+a+year+old%29.+If+I+find+any+problem+and+need+to+update+to+newer+versions+I%27ll+report+that+here+%3B-%29%0D%0A%0D%0A%0D%0Asudo+apt-get+install+rubygems%0D%0Asudo+rm+%2Fvar%2Flib%2Fgems%2F1.8%2Fsource_cache%0D%0Asudo+gem+update%0D%0Asudo+apt-get+install+ruby1.8-dev%0D%0Asudo+gem+install+hpricot%0D%0A%0D%0A%0D%0AIt%27s+a+pity+they+don%27t+have+a+metapackage+for+ruby%27s+development+files+%28the+ruby1.8-dev+package%29%2C+the+same+way+there%27s+a+ruby+metapackage+which+depends+on+the+ruby1.8+package%2C+so+whenever+ruby+is+updated+it+will+update+the+ruby+version+as+well%2C+without+the+user+having+to+worry+about+the+version+number.%0D%0A%0D%0AEven+more%2C+I+instinctively+tried+a+naive+sudo+apt-get+install+rubydev+and+was+greeted+with+a+sad+%22Couldn%27t+find+package+rubydev%22.+It+somehow+proves+that+a+metapackage+called+rubydev+would+be+quite+useful...+at+least+for+instinctive+users.%0D%0A%0D%0AEnjoy+your+screen+scrapping%21&amp;amp;tags=hpricot%2Cruby%2Cubuntu%2Cblog" type="text/html" />
	</item>
		<item>
		<title>Parsing a del.icio.us export with Hpricot</title>
		<link>http://soledadpenades.com/2008/03/25/parsing-a-delicious-export-with-hpricot/</link>
		<comments>http://soledadpenades.com/2008/03/25/parsing-a-delicious-export-with-hpricot/#comments</comments>
		<pubDate>Tue, 25 Mar 2008 08:54:11 +0000</pubDate>
		<dc:creator>sole</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[bookmarks]]></category>
		<category><![CDATA[data scrapping]]></category>
		<category><![CDATA[delicious]]></category>
		<category><![CDATA[hpricot]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://soledadpenades.com/2008/03/25/parsing-a-delicious-export-with-hpricot/</guid>
		<description><![CDATA[The trickiest part is to detect if a bookmark has a corresponding description. The export is in the same format that Netscape used for its bookmarks export, which means it is a simple html file with a definition list (dl) and a series of definition terms (dt). A term (=bookmarks) may have a description (dd). [...]]]></description>
			<content:encoded><![CDATA[<p>The trickiest part is to detect if a bookmark has a corresponding description. The export is in the same format that Netscape used for its bookmarks export, which means it is a simple html file with a definition list (<strong>dl</strong>) and a series of definition terms (<strong>dt</strong>). A term (=bookmarks) may have a description (<strong>dd</strong>).</p>
<p>But how do you detect if there&#8217;s a description? It seems the answer was rather simple: use <strong>term.next</strong> and if the <em>next</em> element&#8217;s name is <em>dd</em>, we&#8217;re lucky and have a description. The only problem was that I didn&#8217;t know how to access the name of an element, until I just thought: what if I simply use <em>name</em>? and guess what&#8230; it worked! So term.next.name was exactly what I looked for :-)</p>
<div class="syhi_block"><code><span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'rubygems'</span><br />
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'hpricot'</span><br />
<br />
doc = <span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;bookmarks.html&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">|</span>f<span style="color:#006600; font-weight:bold;">|</span> Hpricot<span style="color:#006600; font-weight:bold;">&#40;</span>f<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#125;</span><br />
<br />
bookmarks = <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">&#93;</span><br />
<br />
<span style="color:#006600; font-weight:bold;">&#40;</span>doc<span style="color:#006600; font-weight:bold;">/</span><span style="color:#996600;">&quot;dl/dt&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>term<span style="color:#006600; font-weight:bold;">|</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; link = <span style="color:#006600; font-weight:bold;">&#40;</span>term<span style="color:#006600; font-weight:bold;">/</span><span style="color:#996600;">&quot;a&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">if</span> term.<span style="color:#9966CC; font-weight:bold;">next</span> <span style="color:#9966CC; font-weight:bold;">and</span> term.<span style="color:#9966CC; font-weight:bold;">next</span>.<span style="color:#9900CC;">name</span> == <span style="color:#996600;">'dd'</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; desc = term.<span style="color:#9966CC; font-weight:bold;">next</span>.<span style="color:#9900CC;">inner_text</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">else</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; desc = <span style="color:#0000FF; font-weight:bold;">nil</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">if</span> link.<span style="color:#9900CC;">attr</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'tags'</span><span style="color:#006600; font-weight:bold;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tags = link.<span style="color:#9900CC;">attr</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'tags'</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#CC0066; font-weight:bold;">split</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;,&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">else</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tags = <span style="color:#0000FF; font-weight:bold;">nil</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <br />
&nbsp; &nbsp; &nbsp; &nbsp; bookmarks <span style="color:#006600; font-weight:bold;">&lt;&lt;</span> <span style="color:#006600; font-weight:bold;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#ff3333; font-weight:bold;">:address</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#006600; font-weight:bold;">=&gt;</span>&nbsp; &nbsp; &nbsp; link.<span style="color:#9900CC;">attr</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'href'</span><span style="color:#006600; font-weight:bold;">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#ff3333; font-weight:bold;">:created_at</span> &nbsp; &nbsp; <span style="color:#006600; font-weight:bold;">=&gt;</span>&nbsp; &nbsp; &nbsp; link.<span style="color:#9900CC;">attr</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'last_visit'</span><span style="color:#006600; font-weight:bold;">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#ff3333; font-weight:bold;">:tags</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#006600; font-weight:bold;">=&gt;</span>&nbsp; &nbsp; &nbsp; tags,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#ff3333; font-weight:bold;">:description</span>&nbsp; &nbsp; <span style="color:#006600; font-weight:bold;">=&gt;</span>&nbsp; &nbsp; &nbsp; desc,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#ff3333; font-weight:bold;">:title</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#006600; font-weight:bold;">=&gt;</span>&nbsp; &nbsp; &nbsp; link.<span style="color:#9900CC;">inner_text</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#006600; font-weight:bold;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <br />
<span style="color:#9966CC; font-weight:bold;">end</span></code></div>
<p><a href="http://github.com/sole/snippets/blob/master/web/scrapping/delicious_dump_parse/extract.rb">Source</a> at supersnippets.</p>
<p>I also extended this a bit to save the results into a database, using ActiveRecord, but since <em>each db schema is a different world</em>, I didn&#8217;t post that version here. If anybody thinks it might be useful just let me know.</p>
<p>Also, this code is not very <em>rubyesque</em> yet, suggestions in order to improve it will be really appreciated. I&#8217;m specially thinking about the <em>if &#8230; else</em> parts, I&#8217;m pretty sure there&#8217;s a way to shorten those lines :-)</p>
 <p><a href="http://soledadpenades.com/?flattrss_redirect&amp;id=690&amp;md5=b18abb1777f8d959fefc87e1c4b5e248" title="Flattr" target="_blank"><img src="http://soledadpenades.com/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://soledadpenades.com/2008/03/25/parsing-a-delicious-export-with-hpricot/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<atom:link rel="payment" href="https://flattr.com/submit/auto?user_id=8399&amp;amp;url=http%3A%2F%2Fsoledadpenades.com%2F2008%2F03%2F25%2Fparsing-a-delicious-export-with-hpricot%2F&amp;amp;language=en_GB&amp;amp;category=text&amp;amp;title=Parsing+a+del.icio.us+export+with+Hpricot&amp;amp;description=The+trickiest+part+is+to+detect+if+a+bookmark+has+a+corresponding+description.+The+export+is+in+the+same+format+that+Netscape+used+for+its+bookmarks+export%2C+which+means+it+is+a+simple+html+file+with+a+definition+list+%28dl%29+and+a+series+of+definition+terms+%28dt%29.+A+term+%28%3Dbookmarks%29+may+have+a+description+%28dd%29.%0D%0A%0D%0ABut+how+do+you+detect+if+there%27s+a+description%3F+It+seems+the+answer+was+rather+simple%3A+use+term.next+and+if+the+next+element%27s+name+is+dd%2C+we%27re+lucky+and+have+a+description.+The+only+problem+was+that+I+didn%27t+know+how+to+access+the+name+of+an+element%2C+until+I+just+thought%3A+what+if+I+simply+use+name%3F+and+guess+what...+it+worked%21+So+term.next.name+was+exactly+what+I+looked+for+%3A-%29%0D%0A%0D%0A%0D%0Arequire+%27rubygems%27%0D%0Arequire+%27hpricot%27%0D%0A%0D%0Adoc+%3D+open%28%22bookmarks.html%22%29+%7B%7Cf%7C+Hpricot%28f%29+%7D%0D%0A%0D%0Abookmarks+%3D+%5B%5D%0D%0A%0D%0A%28doc%2F%22dl%2Fdt%22%29.each+do+%7Cterm%7C%0D%0A%09link+%3D+%28term%2F%22a%22%29%0D%0A%09%0D%0A%09if+term.next+and+term.next.name+%3D%3D+%27dd%27%0D%0A%09%09desc+%3D+term.next.inner_text%0D%0A%09else%0D%0A%09%09desc+%3D+nil%0D%0A%09end%0D%0A%09%0D%0A%09if+link.attr%28%27tags%27%29%0D%0A%09%09tags+%3D+link.attr%28%27tags%27%29.split%28%22%2C%22%29%0D%0A%09else%0D%0A%09%09tags+%3D+nil%0D%0A%09end%0D%0A%09%0D%0A%09bookmarks+%09link.attr%28%27href%27%29%2C%0D%0A%09%09%3Acreated_at%09%3D%3E%09link.attr%28%27last_visit%27%29%2C%0D%0A%09%09%3Atags%09%09%09%3D%3E%09tags%2C%0D%0A%09%09%3Adescription%09%3D%3E%09desc%2C%0D%0A%09%09%3Atitle%09%09%09%3D%3E%09link.inner_text%0D%0A%09%7D%0D%0A%09%0D%0Aend%0D%0A%0D%0A%0D%0ASource+at+supersnippets.%0D%0A%0D%0AI+also+extended+this+a+bit+to+save+the+results+into+a+database%2C+using+ActiveRecord%2C+but+since+each+db+schema+is+a+different+world%2C+I+didn%27t+post+that+version+here.+If+anybody+thinks+it+might+be+useful+just+let+me+know.%0D%0A%0D%0AAlso%2C+this+code+is+not+very+rubyesque+yet%2C+suggestions+in+order+to+improve+it+will+be+really+appreciated.+I%27m+specially+thinking+about+the+if+...+else+parts%2C+I%27m+pretty+sure+there%27s+a+way+to+shorten+those+lines+%3A-%29&amp;amp;tags=bookmarks%2Cdata+scrapping%2Cdelicious%2Chpricot%2Cruby%2Cblog" type="text/html" />
	</item>
		<item>
		<title>Removing elements with Hpricot</title>
		<link>http://soledadpenades.com/2007/10/05/removing-elements-with-hpricot/</link>
		<comments>http://soledadpenades.com/2007/10/05/removing-elements-with-hpricot/#comments</comments>
		<pubDate>Fri, 05 Oct 2007 10:08:53 +0000</pubDate>
		<dc:creator>sole</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[hpricot]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://www.soledadpenades.com/2007/10/05/removing-elements-with-hpricot/</guid>
		<description><![CDATA[Something like a month ago, a guy asked me how to remove elements with Hpricot. I told him I would look into it but it&#8217;s been a month already! So I hope I can compensate for the delay with this minitutorial on removing stuff with Hpricot! :-) First I created a simple test page. It&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>Something like a month ago, <a href="http://xbelanch.wordpress.com/">a guy</a> <a href="http://www.soledadpenades.com/2007/06/15/extracting-data-with-hpricot/#comment-44218">asked me</a> how to remove elements with Hpricot. I told him I would look into it but it&#8217;s been a month already! So I hope I can compensate for the delay with this minitutorial on removing stuff with Hpricot! :-)</p>
<p>First I created a simple test page. It&#8217;s got some html elements, some have id&#8217;s, some contain certain text nodes. It looks like this:</p>
<div class="syhi_block"><code><span style="color: #009900;">&lt;<a href="http://december.com/html/4/element/p.html"><span style="color: #000000; font-weight: bold;">p</span></a>&gt;</span>This is a paragraph without attributes<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><a href="http://december.com/html/4/element/p.html"><span style="color: #000000; font-weight: bold;">p</span></a>&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&lt;<a href="http://december.com/html/4/element/p.html"><span style="color: #000000; font-weight: bold;">p</span></a> <span style="color: #000066;">id</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;bad_attribute&quot;</span>&gt;</span>This is a paragraph with one attribute: id=bad_attribute<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><a href="http://december.com/html/4/element/p.html"><span style="color: #000000; font-weight: bold;">p</span></a>&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&lt;<a href="http://december.com/html/4/element/ul.html"><span style="color: #000000; font-weight: bold;">ul</span></a>&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&lt;<a href="http://december.com/html/4/element/li.html"><span style="color: #000000; font-weight: bold;">li</span></a>&gt;</span>Element 1<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><a href="http://december.com/html/4/element/li.html"><span style="color: #000000; font-weight: bold;">li</span></a>&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&lt;<a href="http://december.com/html/4/element/li.html"><span style="color: #000000; font-weight: bold;">li</span></a>&gt;</span>This will be removed because the text doesn't begin with an E<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><a href="http://december.com/html/4/element/li.html"><span style="color: #000000; font-weight: bold;">li</span></a>&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><a href="http://december.com/html/4/element/ul.html"><span style="color: #000000; font-weight: bold;">ul</span></a>&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&lt;<a href="http://december.com/html/4/element/ul.html"><span style="color: #000000; font-weight: bold;">ul</span></a> <span style="color: #000066;">id</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;second_list&quot;</span> <span style="color: #000066;">style</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;border:1px solid red;&quot;</span>&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&lt;<a href="http://december.com/html/4/element/li.html"><span style="color: #000000; font-weight: bold;">li</span></a>&gt;</span>Element 1 in the list with id=second_list<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><a href="http://december.com/html/4/element/li.html"><span style="color: #000000; font-weight: bold;">li</span></a>&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&lt;<a href="http://december.com/html/4/element/li.html"><span style="color: #000000; font-weight: bold;">li</span></a>&gt;</span>element 2<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><a href="http://december.com/html/4/element/li.html"><span style="color: #000000; font-weight: bold;">li</span></a>&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><a href="http://december.com/html/4/element/ul.html"><span style="color: #000000; font-weight: bold;">ul</span></a>&gt;</span></code></div>
<p>The question was how to remove certain individual elements given certain conditions &#8211; more specifically, when the element attributes matched a condition. I don&#8217;t see why he had problems removing stuff with the <strong>remove</strong> method, since that&#8217;s what I have used. Since <strong>search</strong> returns a collection of elements, you just need to get a collection which contains only the element you want to remove, and then apply <strong>remove</strong> to that collection.</p>
<p>Here are three examples:</p>
<h3>Removing the paragraph with id = bad_attribute</h3>
<p>We find out the element using CSS selectors, where the hash means &#8216;id&#8217;.</p>
<div class="syhi_block"><code>doc.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;p#bad_attribute&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">remove</span></code></div>
<h3>Removing all the unordered lists (ul&#8217;s) which have an style attribute</h3>
<p>Again, using CSS selectors:</p>
<div class="syhi_block"><code>doc.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;ul[@style]&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">remove</span></code></div>
<p>There&#8217;s more info about CSS selectors in the <a href="http://code.whytheluckystiff.net/hpricot/wiki/HpricotCssSearch">Hpricot CSS search documentation</a>. One can get very creative with this and allows for filtering almost everything!</p>
<h3>Removing elements whose contents match certain conditions</h3>
<p>When it&#8217;s not enough with CSS selectors, we can perfectly take advantage of ruby!</p>
<p>For example, if you want to remove list items (li&#8217;s) whose text doesn&#8217;t begin with E, you could do it with this:</p>
<div class="syhi_block"><code>doc.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;li&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">collect</span>!<span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">|</span>node<span style="color:#006600; font-weight:bold;">|</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; node <span style="color:#9966CC; font-weight:bold;">if</span> <span style="color:#9966CC; font-weight:bold;">not</span> <span style="color:#006600; font-weight:bold;">/</span>^E<span style="color:#006600; font-weight:bold;">/</span>.<span style="color:#9900CC;">match</span><span style="color:#006600; font-weight:bold;">&#40;</span>node.<span style="color:#9900CC;">inner_text</span><span style="color:#006600; font-weight:bold;">&#41;</span><br />
<span style="color:#006600; font-weight:bold;">&#125;</span>.<span style="color:#9900CC;">compact</span>.<span style="color:#9900CC;">remove</span></code></div>
<p>which is the same as saying:</p>
<ul>
<li>Look for every list item in the document</li>
<li>Take the results of that search (which is an Array of Hpricot Elements) and apply the <a href="http://www.ruby-doc.org/core/classes/Array.html#M002211">collect!</a> function to them</li>
<li><strong>collect!</strong> executes the code in the block for each element and stores the return value in an array</li>
<li>But as it can return nils (when the inner_text doesn&#8217;t begin with &#8216;E&#8217; and hence doesn&#8217;t match our little regular expression), we remove nil values from the array with <a href="http://www.ruby-doc.org/core/classes/Array.html#M002239">compact</a>, so that we don&#8217;t get errors when removing.</li>
<li>And finally, remove the elements which are in the resulting array, with the classical Hpricot remove</li>
</ul>
<p>Note how I used collect! instead of just collect, so that the changes are applied over the search results, and we don&#8217;t get a new array instead.</p>
<p>You should try using <strong>collect</strong> instead of <strong>collect!</strong>, and removing <strong>compact</strong> from the chain, to see what happens.</p>
<h3>Final result</h3>
<p>If one applies all these evil removals to the original code, the final result is this:</p>
<div class="syhi_block"><code><span style="color: #009900;">&lt;<a href="http://december.com/html/4/element/p.html"><span style="color: #000000; font-weight: bold;">p</span></a>&gt;</span>This is a paragraph without attributes<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><a href="http://december.com/html/4/element/p.html"><span style="color: #000000; font-weight: bold;">p</span></a>&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&lt;<a href="http://december.com/html/4/element/ul.html"><span style="color: #000000; font-weight: bold;">ul</span></a>&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&lt;<a href="http://december.com/html/4/element/li.html"><span style="color: #000000; font-weight: bold;">li</span></a>&gt;</span>Element 1<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><a href="http://december.com/html/4/element/li.html"><span style="color: #000000; font-weight: bold;">li</span></a>&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><a href="http://december.com/html/4/element/ul.html"><span style="color: #000000; font-weight: bold;">ul</span></a>&gt;</span></code></div>
<p>Pretty empty, isn&#8217;t it?!</p>
<h3>Download these examples</h3>
<p>I&#8217;ve uploaded the hpricot_remove_elements.rb and test.html together in a zip file: <a href="/files/hpricot/hpricot_remove_elements.zip">hpricot_remove_elements.zip</a>. For running it, just unpack, and type ruby hpricot_remove_elements.rb</p>
<p>Or open with textmate and press Option+R ;-)</p>
 <p><a href="http://soledadpenades.com/?flattrss_redirect&amp;id=661&amp;md5=a640a0f9d35af84cd463166e4cdde95e" title="Flattr" target="_blank"><img src="http://soledadpenades.com/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://soledadpenades.com/2007/10/05/removing-elements-with-hpricot/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		<atom:link rel="payment" href="https://flattr.com/submit/auto?user_id=8399&amp;amp;url=http%3A%2F%2Fsoledadpenades.com%2F2007%2F10%2F05%2Fremoving-elements-with-hpricot%2F&amp;amp;language=en_GB&amp;amp;category=text&amp;amp;title=Removing+elements+with+Hpricot&amp;amp;description=Something+like+a+month+ago%2C+a+guy+asked+me+how+to+remove+elements+with+Hpricot.+I+told+him+I+would+look+into+it+but+it%27s+been+a+month+already%21+So+I+hope+I+can+compensate+for+the+delay+with+this+minitutorial+on+removing+stuff+with+Hpricot%21+%3A-%29%0D%0A%0D%0AFirst+I+created+a+simple+test+page.+It%27s+got+some+html+elements%2C+some+have+id%27s%2C+some+contain+certain+text+nodes.+It+looks+like+this%3A%0D%0A%0D%0A%0D%0A%09This+is+a+paragraph+without+attributes%0D%0A%09This+is+a+paragraph+with+one+attribute%3A+id%3Dbad_attribute%0D%0A%09%0D%0A%09%09Element+1%0D%0A%09%09This+will+be+removed+because+the+text+doesn%27t+begin+with+an+E%0D%0A%09%0D%0A%09%0D%0A%09%09Element+1+in+the+list+with+id%3Dsecond_list%0D%0A%09%09element+2%0D%0A%09%0D%0A%0D%0A%0D%0AThe+question+was+how+to+remove+certain+individual+elements+given+certain+conditions+-+more+specifically%2C+when+the+element+attributes+matched+a+condition.+I+don%27t+see+why+he+had+problems+removing+stuff+with+the+remove+method%2C+since+that%27s+what+I+have+used.+Since+search+returns+a+collection+of+elements%2C+you+just+need+to+get+a+collection+which+contains+only+the+element+you+want+to+remove%2C+and+then+apply+remove+to+that+collection.%0D%0A%0D%0AHere+are+three+examples%3A%0D%0A%0D%0ARemoving+the+paragraph+with+id+%3D+bad_attribute%0D%0AWe+find+out+the+element+using+CSS+selectors%2C+where+the+hash+means+%27id%27.%0D%0Adoc.search%28%22p%23bad_attribute%22%29.remove%0D%0A%0D%0ARemoving+all+the+unordered+lists+%28ul%27s%29+which+have+an+style+attribute%0D%0AAgain%2C+using+CSS+selectors%3A%0D%0A%0D%0Adoc.search%28%22ul%5B%40style%5D%22%29.remove%0D%0A%0D%0A%0D%0AThere%27s+more+info+about+CSS+selectors+in+the+Hpricot+CSS+search+documentation.+One+can+get+very+creative+with+this+and+allows+for+filtering+almost+everything%21%0D%0A%0D%0ARemoving+elements+whose+contents+match+certain+conditions%0D%0AWhen+it%27s+not+enough+with+CSS+selectors%2C+we+can+perfectly+take+advantage+of+ruby%21%0D%0A%0D%0AFor+example%2C+if+you+want+to+remove+list+items+%28li%27s%29+whose+text+doesn%27t+begin+with+E%2C+you+could+do+it+with+this%3A%0D%0A%0D%0Adoc.search%28%22li%22%29.collect%21%7B%7Cnode%7C%0D%0A%09node+if+not+%2F%5EE%2F.match%28node.inner_text%29%0D%0A%7D.compact.remove%0D%0A%0D%0A%0D%0Awhich+is+the+same+as+saying%3A%0D%0A%0D%0ALook+for+every+list+item+in+the+document%0D%0ATake+the+results+of+that+search+%28which+is+an+Array+of+Hpricot+Elements%29+and+apply+the+collect%21+function+to+them%0D%0Acollect%21+executes+the+code+in+the+block+for+each+element+and+stores+the+return+value+in+an+array%0D%0ABut+as+it+can+return+nils+%28when+the+inner_text+doesn%27t+begin+with+%27E%27+and+hence+doesn%27t+match+our+little+regular+expression%29%2C+we+remove+nil+values+from+the+array+with+compact%2C+so+that+we+don%27t+get+errors+when+removing.%0D%0AAnd+finally%2C+remove+the+elements+which+are+in+the+resulting+array%2C+with+the+classical+Hpricot+remove%0D%0A%0D%0A%0D%0ANote+how+I+used+collect%21+instead+of+just+collect%2C+so+that+the+changes+are+applied+over+the+search+results%2C+and+we+don%27t+get+a+new+array+instead.%0D%0A%0D%0AYou+should+try+using+collect+instead+of+collect%21%2C+and+removing+compact+from+the+chain%2C+to+see+what+happens.%0D%0A%0D%0AFinal+result%0D%0AIf+one+applies+all+these+evil+removals+to+the+original+code%2C+the+final+result+is+this%3A%0D%0A%0D%0A%09%09This+is+a+paragraph+without+attributes%0D%0A%09%09%0D%0A%09%0D%0A%09%09Element+1%0D%0A%09%09%0D%0A%09%0D%0A%0D%0A%0D%0APretty+empty%2C+isn%27t+it%3F%21%0D%0A%0D%0ADownload+these+examples%0D%0AI%27ve+uploaded+the+hpricot_remove_elements.rb+and+test.html+together+in+a+zip+file%3A+hpricot_remove_elements.zip.+For+running+it%2C+just+unpack%2C+and+type+ruby+hpricot_remove_elements.rb%0D%0A%0D%0AOr+open+with+textmate+and+press+Option%2BR+%3B-%29&amp;amp;tags=hpricot%2Cruby%2Cblog" type="text/html" />
	</item>
		<item>
		<title>Extracting data with Hpricot</title>
		<link>http://soledadpenades.com/2007/06/15/extracting-data-with-hpricot/</link>
		<comments>http://soledadpenades.com/2007/06/15/extracting-data-with-hpricot/#comments</comments>
		<pubDate>Thu, 14 Jun 2007 23:01:17 +0000</pubDate>
		<dc:creator>sole</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[data scrapping]]></category>
		<category><![CDATA[exif]]></category>
		<category><![CDATA[firebug]]></category>
		<category><![CDATA[hpricot]]></category>
		<category><![CDATA[jquery]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[xpath]]></category>

		<guid isPermaLink="false">http://www.soledadpenades.com/2007/06/15/extracting-data-with-hpricot/</guid>
		<description><![CDATA[For those (few) of you which haven&#8217;t heard about it, Hpricot is a nice library for parsing HTML in ruby, created by the even nicer _whytheluckystiff, author of Poignant&#8217;s Guide to Ruby, Camping and other ruby gems (may you excuse the pun? it was impossible to avoid it). Since I saw one demonstration by Rob [...]]]></description>
			<content:encoded><![CDATA[<p>For those (few) of you which haven&#8217;t heard about it, <a href="http://code.whytheluckystiff.net/hpricot/">Hpricot</a> is a nice library for parsing HTML in ruby, created by the even nicer <a href="http://whytheluckystiff.net/">_whytheluckystiff</a>, author of <a href="http://poignantguide.net/ruby/">Poignant&#8217;s Guide to Ruby</a>, <a href="http://code.whytheluckystiff.net/camping/">Camping</a> and other <em>ruby gems</em> (may you excuse the pun? it was impossible to avoid it).</p>
<p>Since I saw one demonstration by Rob McKinnon at <a href="http://www.soledadpenades.com/2007/03/13/london-ruby-users-group-brings-you-back-to-uni/">certain LRUG meeting</a>, I have been willing to try Hpricot, but I hadn&#8217;t seen an application for it yet. No more! I found myself today wanting to extract data from a table in a web page and suddenly I thought: <q>this is a job for Hpricot!</q>. More specifically, I wanted to extract <a href="http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html">these EXIF tags</a>, and I simply couldn&#8217;t accept the mere thinking of entering that data manually. It needed to be automated!</p>
<h3>Getting it</h3>
<p><strong>Getting Hpricot</strong> is very easy:
<div class="syhi_block"><code><span style="color: #c20cb9; font-weight: bold;">sudo</span> gem <span style="color: #c20cb9; font-weight: bold;">install</span> hpricot</code></div>
<p> (if you&#8217;re picky you can try more exotic ways of installing in its homepage).
<div class="syhi_block"><code>gem <span style="color: #c20cb9; font-weight: bold;">install</span> hpricot</code></div>
<p> if you&#8217;re in windows, of course.</p>
<p><strong>Understanding it</strong> is easy as well, specially if you have used <a href="http://jquery.com/">jquery</a> before. It&#8217;s all about writing selectors for looking for things, so it helps a lot if the HTML document is well marked. Otherwise, you might have to end up doing lots of workarounds or extra code that could be avoided simply by having a class or id specified in the relevant elements.</p>
<h3>Inspecting &amp; traversing</h3>
<p>So, once I got the library installed, I took a look at the page source code with <a href="http://www.getfirebug.com/">Firebug</a>. It is specially useful for this kind of jobs because it helps you to <strong>visualize the hierarchy of elements in the page</strong>, including classes and id&#8217;s, so you don&#8217;t have to traverse manually the HTML tree to gather the data you need.</p>
<p>What I was looking for was the table which contained the relevant data. In this case, we&#8217;re lucky and even if the table hasn&#8217;t got an id attribute which would make it uniquely identifiable in the whole document, it still has class=&#8221;inner&#8221;, which happens to be used only once in it, thus acting effectively as an element identifier.</p>
<p><img src="/imgs/firebug.png" alt="Firebug in action!" /></p>
<p>Note how Firebug is showing the tree path for the selected table. If we didn&#8217;t have the class attribute, we would need to use a selector like &#8220;/html/body/blockquote/table/tbody/tr/td/table&#8221;, but it will be something as simple as &#8220;/table.inner&#8221;.</p>
<h3>Hands on Ruby</h3>
<p>Ok, so this is where we write a few lines of code which do a lot ;-)</p>
<p>First come the usual series of requires:</p>
<div class="syhi_block"><code><span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'rubygems'</span><br />
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'hpricot'</span><br />
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'open-uri'</span></code></div>
<p><strong>Rubygems</strong> is required in order to load <strong>hpricot</strong>, and <strong>open-uri</strong> is required in order to directly read data from a URI. open-uri comes with ruby, so we don&#8217;t need to install anything else.</p>
<p>Now we need to get the HTML file. It is as simple as</p>
<div class="syhi_block"><code>doc = Hpricot<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span><span style="color:#006600; font-weight:bold;">&#41;</span></code></div>
<p>but since I was doing lots of tests and didn&#8217;t want to overload that guy&#8217;s server, I simply saved the document as EXIF.html and loaded it with this instead:</p>
<div class="syhi_block"><code>doc = <span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;EXIF.html&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006600; font-weight:bold;">|</span>f<span style="color:#006600; font-weight:bold;">|</span> Hpricot<span style="color:#006600; font-weight:bold;">&#40;</span>f<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#125;</span></code></div>
<p>At this point we have the HTML document in the doc variable, so what are we waiting for?<br />
We initialize a rows variable for holding the data that we&#8217;ll extract:</p>
<div class="syhi_block"><code>rows = <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">&#93;</span></code></div>
<p>And now comes the real fun!</p>
<div class="syhi_block"><code><span style="color:#006600; font-weight:bold;">&#40;</span>doc<span style="color:#006600; font-weight:bold;">/</span><span style="color:#996600;">&quot;table.inner//tr&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>row<span style="color:#006600; font-weight:bold;">|</span><br />
&nbsp; &nbsp; cells = <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">&#93;</span><br />
&nbsp; &nbsp; <span style="color:#006600; font-weight:bold;">&#40;</span>row<span style="color:#006600; font-weight:bold;">/</span><span style="color:#996600;">&quot;td&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>cell<span style="color:#006600; font-weight:bold;">|</span><br />
&nbsp; &nbsp; &nbsp; &nbsp;<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">if</span> <span style="color:#006600; font-weight:bold;">&#40;</span>cell<span style="color:#006600; font-weight:bold;">/</span><span style="color:#996600;">&quot; span.s&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">length</span> <span style="color:#006600; font-weight:bold;">&gt;</span> <span style="color:#006666;">0</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; values = <span style="color:#006600; font-weight:bold;">&#40;</span>cell<span style="color:#006600; font-weight:bold;">/</span><span style="color:#996600;">&quot;span.s&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">inner_html</span>.<span style="color:#CC0066; font-weight:bold;">split</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'&lt;br /&gt;'</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">collect</span><span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006600; font-weight:bold;">|</span>str<span style="color:#006600; font-weight:bold;">|</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; pair = str.<span style="color:#9900CC;">strip</span>.<span style="color:#CC0066; font-weight:bold;">split</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'='</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">collect</span><span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">|</span>val<span style="color:#006600; font-weight:bold;">|</span> val.<span style="color:#9900CC;">strip</span><span style="color:#006600; font-weight:bold;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#CC00FF; font-weight:bold;">Hash</span><span style="color:#006600; font-weight:bold;">&#91;</span>pair<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">&#93;</span>, pair<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">1</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#006600; font-weight:bold;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">if</span><span style="color:#006600; font-weight:bold;">&#40;</span>values.<span style="color:#9900CC;">length</span>==<span style="color:#006666;">1</span><span style="color:#006600; font-weight:bold;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; cells <span style="color:#006600; font-weight:bold;">&lt;</span> <span style="color:#006600; font-weight:bold;">&lt;</span> cell.<span style="color:#9900CC;">inner_text</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">else</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; cells <span style="color:#006600; font-weight:bold;">&lt;&lt;</span> values<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">elsif</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; cells <span style="color:#006600; font-weight:bold;">&lt;&lt;</span> cell.<span style="color:#9900CC;">inner_text</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span><br />
&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span><br />
&nbsp; &nbsp; rows <span style="color:#006600; font-weight:bold;">&lt;&lt;</span> cells<br />
&nbsp; &nbsp; <br />
<span style="color:#9966CC; font-weight:bold;">end</span></code></div>
<p>Ok, not that fast. I&#8217;ll elaborate a little more on the juicy bits.</p>
<div class="syhi_block"><code><span style="color:#006600; font-weight:bold;">&#40;</span>doc<span style="color:#006600; font-weight:bold;">/</span><span style="color:#996600;">&quot;table.inner//tr&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>row<span style="color:#006600; font-weight:bold;">|</span></code></div>
<p>This is the key for reaching the main data. It&#8217;s like saying <q>I&#8217;m looking in doc for all the rows (the tr&#8217;s) which are contained in a table whose class equals &#8216;inner&#8217;</q>. When we use a / it means we want an immediate child. // means a child below the element. As I said before, it&#8217;s all about selecting and traversing the tree.</p>
<p>With the last line of code, we get returned the content of each tr into the <strong>row</strong> variable. We can continue extracting data from within <strong>row</strong>, and that&#8217;s exactly what we do with
<div class="syhi_block"><code><span style="color:#006600; font-weight:bold;">&#40;</span>row<span style="color:#006600; font-weight:bold;">/</span><span style="color:#996600;">&quot;td&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>cell<span style="color:#006600; font-weight:bold;">|</span></code></div>
<p>That one provides us with all the td elements immediately below the current row.</p>
<p>When we reach the td elements, all that is left is to extract the data for each cell and push it into the cells array, which will be pushed into the rows array. But we don&#8217;t just copy the cell data as it is; some cells contain notes, and some of those notes contain lists of values. I think we can all agree that those lists of values are commonly called Hashes, and they undoubtedly deserve an special treatment!</p>
<div class="syhi_block"><code><span style="color:#9966CC; font-weight:bold;">if</span> <span style="color:#006600; font-weight:bold;">&#40;</span>cell<span style="color:#006600; font-weight:bold;">/</span><span style="color:#996600;">&quot; span.s&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">length</span> <span style="color:#006600; font-weight:bold;">&gt;</span> <span style="color:#006666;">0</span></code></div>
<p>So that&#8217;s why I&#8217;m checking for the existance of an span with class == s inside each cell. If we find one, there&#8217;s a note in this row, and probably there&#8217;s one hash with values. I would say this is the funniest part of all:</p>
<div class="syhi_block"><code>values = <span style="color:#006600; font-weight:bold;">&#40;</span>cell<span style="color:#006600; font-weight:bold;">/</span><span style="color:#996600;">&quot;span.s&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">inner_html</span>.<span style="color:#CC0066; font-weight:bold;">split</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'&lt;br /&gt;'</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">collect</span><span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006600; font-weight:bold;">|</span>str<span style="color:#006600; font-weight:bold;">|</span> <br />
&nbsp; pair = str.<span style="color:#9900CC;">strip</span>.<span style="color:#CC0066; font-weight:bold;">split</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'='</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">collect</span><span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">|</span>val<span style="color:#006600; font-weight:bold;">|</span> val.<span style="color:#9900CC;">strip</span><span style="color:#006600; font-weight:bold;">&#125;</span><br />
&nbsp; <span style="color:#CC00FF; font-weight:bold;">Hash</span><span style="color:#006600; font-weight:bold;">&#91;</span>pair<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">&#93;</span>, pair<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">1</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#93;</span><br />
<span style="color:#006600; font-weight:bold;">&#125;</span></code></div>
<p>I&#8217;m making use of the fact that each invoked function is returning another object, so that I can chain them consecutively instead of doing a series of assignments. And it reads like this: <q>Take the html inside the span with class s, split it where you find a <strong>br</strong>, and for each of those split parts remove the surrounding whitespace and split it again where you find a <strong>=</strong>, so we get a pair of key-value values, remove the whitespace for those pairs as well and put them in a new Hash</q>.</p>
<p>At the end we finish with an array of rows and cells, where certain cells occasionally contain a Hash with the constants used by the row EXIF tag.</p>
<p>It&#8217;s also interesting to note that the first row is unusable, because it corresponds to the th elements, so we&#8217;ll simply do a
<div class="syhi_block"><code>rows.<span style="color:#9900CC;">shift</span></code></div>
<p> and it&#8217;s gone. And to top it all, we could output the <strong>rows</strong> array to a yaml file, so that we do not need to run this each time we need the list of EXIF tags.</p>
<p>Arrays in ruby have a lovely method called <strong>to_yaml</strong> which dutifully generates a version of the array in yaml syntax. And it&#8217;s very easy to output that to a file:</p>
<div class="syhi_block"><code><span style="color:#CC00FF; font-weight:bold;">File</span>.<span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'hexif.yaml'</span>, <span style="color:#996600;">'w'</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006600; font-weight:bold;">|</span>f<span style="color:#006600; font-weight:bold;">|</span><br />
&nbsp; f <span style="color:#006600; font-weight:bold;">&lt;&lt;</span> rows.<span style="color:#9900CC;">to_yaml</span><br />
<span style="color:#006600; font-weight:bold;">&#125;</span></code></div>
<p>And you&#8217;re done! I hope you liked this small Hpricot tutorial/introduction&#8230; and if you have any suggestion or improvement please let me know!</p>
<p>Of course, you can get the complete source code here: <a href="http://github.com/sole/snippets/blob/master/web/scrapping/hexif/hexif.rb">hexif.rb</a>. It is a ridiculous 61 lines, including some commented lines and white spaces. <strong>Come on get it and do something cool!</strong></p>
 <p><a href="http://soledadpenades.com/?flattrss_redirect&amp;id=639&amp;md5=6d28a6e951017cc1989a10df005f649c" title="Flattr" target="_blank"><img src="http://soledadpenades.com/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://soledadpenades.com/2007/06/15/extracting-data-with-hpricot/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		<atom:link rel="payment" href="https://flattr.com/submit/auto?user_id=8399&amp;amp;url=http%3A%2F%2Fsoledadpenades.com%2F2007%2F06%2F15%2Fextracting-data-with-hpricot%2F&amp;amp;language=en_GB&amp;amp;category=text&amp;amp;title=Extracting+data+with+Hpricot&amp;amp;description=For+those+%28few%29+of+you+which+haven%27t+heard+about+it%2C+Hpricot+is+a+nice+library+for+parsing+HTML+in+ruby%2C+created+by+the+even+nicer+_whytheluckystiff%2C+author+of+Poignant%27s+Guide+to+Ruby%2C+Camping+and+other+ruby+gems+%28may+you+excuse+the+pun%3F+it+was+impossible+to+avoid+it%29.%0D%0A%0D%0ASince+I+saw+one+demonstration+by+Rob+McKinnon+at+certain+LRUG+meeting%2C+I+have+been+willing+to+try+Hpricot%2C+but+I+hadn%27t+seen+an+application+for+it+yet.+No+more%21+I+found+myself+today+wanting+to+extract+data+from+a+table+in+a+web+page+and+suddenly+I+thought%3A+this+is+a+job+for+Hpricot%21.+More+specifically%2C+I+wanted+to+extract+these+EXIF+tags%2C+and+I+simply+couldn%27t+accept+the+mere+thinking+of+entering+that+data+manually.+It+needed+to+be+automated%21%0D%0A%0D%0AGetting+it%0D%0A%0D%0AGetting+Hpricot+is+very+easy%3A+sudo+gem+install+hpricot+%28if+you%27re+picky+you+can+try+more+exotic+ways+of+installing+in+its+homepage%29.+gem+install+hpricot+if+you%27re+in+windows%2C+of+course.%0D%0A%0D%0AUnderstanding+it+is+easy+as+well%2C+specially+if+you+have+used+jquery+before.+It%27s+all+about+writing+selectors+for+looking+for+things%2C+so+it+helps+a+lot+if+the+HTML+document+is+well+marked.+Otherwise%2C+you+might+have+to+end+up+doing+lots+of+workarounds+or+extra+code+that+could+be+avoided+simply+by+having+a+class+or+id+specified+in+the+relevant+elements.%0D%0A%0D%0AInspecting+%26amp%3B+traversing%0D%0A%0D%0ASo%2C+once+I+got+the+library+installed%2C+I+took+a+look+at+the+page+source+code+with+Firebug.+It+is+specially+useful+for+this+kind+of+jobs+because+it+helps+you+to+visualize+the+hierarchy+of+elements+in+the+page%2C+including+classes+and+id%27s%2C+so+you+don%27t+have+to+traverse+manually+the+HTML+tree+to+gather+the+data+you+need.%0D%0A%0D%0AWhat+I+was+looking+for+was+the+table+which+contained+the+relevant+data.+In+this+case%2C+we%27re+lucky+and+even+if+the+table+hasn%27t+got+an+id+attribute+which+would+make+it+uniquely+identifiable+in+the+whole+document%2C+it+still+has+class%3D%22inner%22%2C+which+happens+to+be+used+only+once+in+it%2C+thus+acting+effectively+as+an+element+identifier.%0D%0A%0D%0A%0D%0A%0D%0ANote+how+Firebug+is+showing+the+tree+path+for+the+selected+table.+If+we+didn%27t+have+the+class+attribute%2C+we+would+need+to+use+a+selector+like+%22%2Fhtml%2Fbody%2Fblockquote%2Ftable%2Ftbody%2Ftr%2Ftd%2Ftable%22%2C+but+it+will+be+something+as+simple+as+%22%2Ftable.inner%22.%0D%0A%0D%0AHands+on+Ruby%0D%0A%0D%0AOk%2C+so+this+is+where+we+write+a+few+lines+of+code+which+do+a+lot+%3B-%29%0D%0A%0D%0AFirst+come+the+usual+series+of+requires%3A%0D%0A%0D%0Arequire+%27rubygems%27%0D%0Arequire+%27hpricot%27%0D%0Arequire+%27open-uri%27%0D%0A%0D%0A%0D%0ARubygems+is+required+in+order+to+load+hpricot%2C+and+open-uri+is+required+in+order+to+directly+read+data+from+a+URI.+open-uri+comes+with+ruby%2C+so+we+don%27t+need+to+install+anything+else.%0D%0A%0D%0ANow+we+need+to+get+the+HTML+file.+It+is+as+simple+as%0D%0A%0D%0Adoc+%3D+Hpricot%28open%28%22http%3A%2F%2Fwww.sno.phy.queensu.ca%2F%7Ephil%2Fexiftool%2FTagNames%2FEXIF.html%22%29%29%0D%0A%0D%0A%0D%0Abut+since+I+was+doing+lots+of+tests+and+didn%27t+want+to+overload+that+guy%27s+server%2C+I+simply+saved+the+document+as+EXIF.html+and+loaded+it+with+this+instead%3A%0D%0A%0D%0Adoc+%3D+open%28%22EXIF.html%22%29+%7B+%7Cf%7C+Hpricot%28f%29+%7D%0D%0A%0D%0A%0D%0AAt+this+point+we+have+the+HTML+document+in+the+doc+variable%2C+so+what+are+we+waiting+for%3F%0D%0AWe+initialize+a+rows+variable+for+holding+the+data+that+we%27ll+extract%3A%0D%0A%0D%0Arows+%3D+%5B%5D%0D%0A%0D%0A%0D%0AAnd+now+comes+the+real+fun%21%0D%0A%0D%0A%28doc%2F%22table.inner%2F%2Ftr%22%29.each+do+%7Crow%7C%0D%0A++++cells+%3D+%5B%5D%0D%0A++++%28row%2F%22td%22%29.each+do+%7Ccell%7C%0D%0A+++++++%0D%0A++++++++if+%28cell%2F%22+span.s%22%29.length+%3E+0%0D%0A++++++++++++++values+%3D+%28cell%2F%22span.s%22%29.inner_html.split%28%27%0A%27%29.collect%7B+%7Cstr%7C+%0D%0A++++++++++++++pair+%3D+str.strip.split%28%27%3D%27%29.collect%7B%7Cval%7C+val.strip%7D%0D%0A++++++++++++++Hash%5Bpair%5B0%5D%2C+pair%5B1%5D%5D%0D%0A++++++++++++%7D%0D%0A++++++++++++%0D%0A++++++++++++if%28values.length%3D%3D1%29%0D%0A++++++++++++++cells+%3C+%3C+cell.inner_text%0D%0A++++++++++++else%0D%0A++++++++++++++cells+&amp;amp;tags=data+scrapping%2Cexif%2Cfirebug%2Chpricot%2Cjquery%2Cruby%2Cxpath%2Cblog" type="text/html" />
	</item>
	</channel>
</rss>

