Reasons for using UTF-8

The subject on encoding is quite confusing and at the beginning one does never really know what are the differences between encoding types, and most importantly, what are the consequences of choosing ISO-8859 instead of UTF-8, so now that I begin to have more arguments than the Trust me, I think this is the right decision one, I want to share with you what I know - and of course, please correct me where I am wrong!

The main problem is the development platform, which happens to be Windows most of the times - and its default encoding, which seems to be ISO-8859. Since the majority of web developers are in countries which have more than enough with ISO-8859 (Europe, North America, etc...), and that majority also tend to use Windows, their servers are set to use ISO-8859, the databases are created using ISO-8859, and the code and templates and by extension the pages come out automatically with ISO-8859 (although I have also noticed that Eclipse sets the default encoding to CP-1252, in every platform - which is something that keeps puzzling me!).

It is fine if you don't expect to have any non ISO-8859 in your content ever, but that only happens in very specific cases - and often you are the only person entering content. But most of the websites you build will probably allow people from all around the world to register and submit their content, and here's where the fun begins:

  1. Even if the site is in English, people's names are still in their own language. Let them enter their name with their characters and don't force them to pseudotranslate them into English. Obviously, the name is just an example. It could also be book and movie titles, or music albums, etc...
  2. If you aggregate feeds from other sites, they most probably will come in UTF-8. If your site is not in UTF-8, you'll have to either use utf8_decode (in php) or convert that text into html entities.
  3. If you use Flash with dynamic content (which you generate), it will expect the content to show up encoded in UTF-8. There's no way of changing that unless you mess around with the evil systemCodepage setting (but that's a bad idea)
  4. If you use AJAX, you need to return UTF-8 content. Just like the Flash case
  5. If you expect to use the content of your non-UTF-8 website in other applications which do support UTF-8 (for example, a reports system) but are not web based and you used the html entities trick for storing UTF-8 content in your database, you'll have to convert back the html entities into UTF-8 or sort of it (and fingers crossed!)

Considering all the above situations, it's easy to see that it's better to use UTF-8 straight on from the beginning.

In that case:

  1. People can register using their normal name. Japanese people (for example) will be happy.
  2. Aggregate everything you want and don't worry about external feeds having characters that your page encoding doesn't include
  3. Flash will be happy. Now you just need to make sure to embed all the characters you may need - but that's another story
  4. AJAX will be happy too
  5. Generate reports without having to mess with html entities. What you query is what you need.

... and the best of all is that as long as your system is properly set up, you don't need to do anything special about UTF-8 in your code. You just need to think about the content, and stop worrying about utf8_encode's or utf8_decode's, htmlentities and all that mess!

Did I convince you to use UTF-8?