Web archaeology

Sun Nov 08 2009 23:38:50 GMT+0000 (Greenwich Mean Time)

If you ask me if there is anything that fascinates me as much as computers do, it definitely is Archaeology. Being able to go somewhere, dig up a few layers of soil and then elaborate a list of facts about the habits of whoever used to inhabit that place seems to me almost magical. If you add both topics together, it simply becomes mind-boggling, specially taking into account the fragility of computer data -- a mischievous magnet placed in the wrong place and data is not what you expected it to be anymore.

So the concept of preserving electronic data forever, or at least for quite a long time --say, a century--, is something that has always interested me. How do we make sure that whichever string of 0's and 1's we write down today are still available in one hundred years time, if we can barely cope with keeping backup copies of our important data?

Replicate, replicate, replicate...

It seems the best solution is replication. Since digital media is so fragile, we prevent potential losses by duplicating data once and once and once again, and yet a couple more times, just in case. I find it awesome that I still can play old cassette tapes and vinyls from when I was a child, and they pretty much sound the same than then. They do look old and noisy, but it's not because they have degraded. It's because they are old (musically and visually speaking) and nowadays we are used to the pristine audio quality of digital media: high quality streamed audio files (i.e. OGG, MP3 at 320Kb, even FLAC's, 32 bit WAV's, you name it). In the visual department, H264, Blu-rays and company. Even DVD's look clumsily blurry if you play them side by side with Blu-ray content.

Now go try to read any CD that you recorded five or six years ago. I'll have to congratulate you if you are successful, because I find it quite a hard task. Last time I checked, pretty much every data CD from six years ago had one or more reading errors, even when they were a good brand, like LG, SONY, etc, not unbranded CDs. The digital generation is a utter, total failure at preserving its memories.

Therefore it is really surprising to find tons of ancient sites out there in the Internet, way older than those unreadable CD's, but still perfectly readable by using a simple browser, as Mark Pilgrim pointed out a couple of days ago in what could be easily described as Applied Web Archaeology.

Web 1.0

One of the sites that was still up there until recently was Geocities, a realm of long-forgotten Internet jewels and embryonic blogs-to-be that Yahoo! shut up past 26th October. For a long time, they actually were my home in the web. Did you know that if you wanted an account there, you had to wander endlessly through your preferred neighbourhood until you found an empty slot to call it home?: Try this one... nah, this one is not empty. Try next block... ah maybe here... oh but I don't really like the number it's got. Let's try another block...

All that for finally being able to upload a whopping 2 megabyte of data to a server. It was awesome! I remember I spoke with excitement about it with the most Internet aware girl I knew of. Two megs! You can't imagine how much stuff could fit into that ridiculous amount! And oh how funny it was doing so!

It was also insanely baroque. In fact, everything was crazy delicious nonsense. Rules? There wasn't any rule. I loved looking at the source code of websites I liked, and whenever I found a nice trick, I would just use it, as soon as possible. So that's how I found that one could specify not only the colour of text, but also the font! Oh, the font tag. And blink, and marquees. And what about tables inside tables to center stuff, blockquotes to give content a little padding to the left, and a couple more of nested blockquotes if you still needed more padding?

A bit later I found out that if you specified the margin and padding attributes in the body tag, you would get rid of all that stupid space the browser wasted. We are talking about a time when 640x480 resolutions were the norm and 800x600 a bit of a luxury. 1024x768 was something we could only dream of, or use with 256 colours only and a headache inducing refresh rate if we were very lucky. So saving those 10-15 pixels a browser wasted on every page gave you plenty of room for adding yet more animated GIFs of under construction workers, send me an e-mail icons --because at that point we couldn't even imagine anyone would spam our precious brand new and empty inbox, and we loved to get and send e-mail to everyone, just because; animated separation bars which were way funnier than the default and boring hr elements (the horizontal rule, just in case you are that young), psychedelic tiling backgrounds and of course, the visits counter, that I consider as the very precursor of Google Analytics!

At some point, my website looked more or less like this:

WELCOME!

You're visitor number counter

since feb'1998

under construction

Send me an e-mail!

No, I haven't recovered the files from a backup. Amazingly, these files can still be found in animated gifs sites.

You quickly realize that the Internet --as we know it today-- began there and then, not in the boring academic, gray background websites. And as happened when mass media fell in love with weblogs and abandoned them later, favoring web 2.0 stories about entrepreneurs and rockstar ninja coders, the same people who signed up for having their own personal ~~website~~ homepage abandoned them as well, letting them stagnate, untouched and in the very state they were the last time anyone dared to manually update those hand crafted HTML files. Maybe they just moved to blogspot, LiveJournal, WordPress or even their own self hosted blog.

Whatever they did is not important. What matters is that those sites were left there, snapshots of a time past eons ago, sitting on those servers for years, largely untouched and forgotten. Visiting a Geocities-hosted website was pretty much like going on a tour around Pompeii, only without the hordes of picture-taking tourists. Well, you could take screenshots if you wanted to, but that's not the point. The thing is, it was all there, we took that for granted, and then Yahoo! found out there was a huge amount of data they were maintaining and apparently no one really cared about. So since they couldn't monetize it and they also needed to cut costs down, it had to be shut down. Respecting web history? That's not for tough guys like Yahoo! - they are into this for the money, after all.

Reocities - RIP Geocities

Luckily a team of talented nostalgic archivists took up a self imposed task: to backup the entire Geocities site before Yahoo! pulled the plug, and make the backup available online again.

Yes, there is the Wayback Machine from Archive.org but it sometimes doesn't cache that well. These guys aimed for a really accurate copy. They haven't finished processing all the data they gathered yet, and the numbers are already massive: 1,993,539 accounts and 30,486,871 archives are online right now, with lots more batches to be imported. Check it out by yourself: Reocities. Maybe your old site is still there, if you still remember its URL. Mine was :-)

And here's the making of:

The ingredients:
* 1 iconic website about to be erased
* 21 pots of strong tea
* more sugar than is probably healthy
* very little sleep
* some computing gear
* one solid Internet connection
* 6 days in October 2009
* Some very good help (Thanks Abi!)</blockquote>
I love it!

But why? Nobody cared about it anyway...

It's not only about the romanticism of being able of indulging in 1990's web design aesthetics, when CSS was a badly implemented feature that largely differed between IE and other browsers (wait! the latter still happens, doesn't it?) and Javascript was something that could be used to add some funky effects to the homepages, although most of the times the scripts only worked with Navigator because it had support for some things that IE hadn't (wait again! what is the status of the audio and video tags?). It's also about the large amounts of information that thousands of people placed there because we had that Utopian idea of the Internet as a virtual cyberlibrary to which we all could contribute with whatever we knew.

All of that happened well before Wikipedia tried to become that very library, and well before Google came up with their perverse PageRank algorithm and tried to swallow the whole Internet in their data centers too, in that strange era when we knew about new pages when somebody recommended them to us. It was like retweeting, only the bandwidth was really narrow. If you really liked something, you made the effort of telling your friends, or linking to it from your homepage, with its own unusual design which made it a truly personal homepage, and not Yet Another Template Based website.

Can this happen again? And... will anyone care about it?

Oh, of course it can -- how dare you doubt it? We are hopeless at backing things up! But there's a team of people dedicated to have a look around websites, analyze whether they are in prospective risk of disappearing, and take action on behalf of those services who treat their dormant customers with contempt. They call themselves The Archive Team, and I found their Deathwatch page an amazingly detailed --if not scarily disturbing-- prediction of what might happen or account of what has already happened with websites like Geocities.

The problem is, nowadays it's so easy to dump one's thoughts on the net that --quite ironically-- most of the "content" out there is plainly worthless and devoid of any actual content. You just go to whatever social website you fancy, sign up and in a couple of minutes you can be connecting with your friends, sharing your life with them and all that blabbering, but you're in fact generating tatty stuff that probably not even you will want to read again in two days time, let alone remember you posted it.

Maybe I'm being too harsh --after all, who am I to decide what is and what isn't worth to keep?

However, I think it's that very harshness that makes me appreciate even more the effort of people like Archive Team, who are working hard to make sure that even services like TinyURL are properly backed up. So when a given URL shortener goes out of business, we'll still be able to make sense of tweets which irresponsibly included them in order to save a few characters. Maybe Mark Pilgrim will use their archived tweets in 2026 to document why do we have the canvas4d element!

Long live Archive Team and Reocities!!!