UTF-8 checklist

Following the discussion in the previous post (Reasons for using UTF-8) I thought it could be interesting to gather a series of steps needed to get a UTF-8 friendly environment.

I'm going to focus on php and mysql, because using mysql and ruby/rails and utf8 tends to be kind of easier (specially since newer Rails versions suggest mysql which charset to use when connecting automagically for you), but the advice can be applied to both platforms in any case.

In your preferred editor

Make sure your editor is set to use UTF-8, specially when editing templates and any other file which is used for building output content. If you include any non-ASCII content (for example, accented words) and they are mixed with more UTF-8 content (from other templates or sources), things will get messed up.

In your html/xml code

Make sure the document's charset is specified.

In HTML documents you would accomplish this with the content/type meta tag, which you should place in the header:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

In XML documents this is done in the XML declaration, which needs to be placed at the immediate beginning of the document:

<?xml version="1.0" encoding="UTF-8"?>

I have heard recommendations for placing the page title immediately after the Content-type has been declared, so that browsers can switch to the right charset immediately if the page title includes any non-ASCII code, but it sounds a little urban myth to me.

I have also seen people recommending to specify the encoding in every form you have in your site, but I haven't found any difference between doing it or not.

In Apache

Make sure the content is being served as UTF-8. A good AddDefaultCharset utf-8 should do. You don't need to change all your hosted content into UTF-8 if you don't want to, charsets can be configured per virtual hosts as well.

In the database

Make sure mysql is configured to use UTF-8 at server level.


# Place after [mysqld] in /etc/mysql/my.cnf or wherever it is in Windows
init-connect='SET NAMES utf8'
default-character-set=utf8
character-set-server=utf8
collation-server=utf8_general_ci

Or if you can't modify my.cnf (shared servers anyone?), before issuing any other query, and right after you've connected to the server, execute sql SET NAMES 'utf8';

It seems otherwise mysql will not recognize properly the character set that the client is using (!!) and will return bad data.

Of course, when you create the tables, make sure that every column which can include text data (this includes TEXT, VARCHAR, etc) is using utf8 as well.

In php

I rarely need to send a header specifying the content-type, but if things are not working totally fine you could also try to manually specify the content type with the header:

header('Content-type: text/html; charset=utf-8');

Or application/xml, whatever you need!

In Flash (or Flex)

Do not use systemCodepage. Flash assumes XML is UTF-8 so if everything else has been properly set up, Flash will be served UTF-8 content, which is what it expects, and we'll be happy :-)

One more thing

Although generally it's a good idea to run a validator, it's even a better idea if you're doing utf-8 stuff and want to make sure you're not outputting bad stuff inadvertently. Validators are picky (even more if we speak about XML validators) and will cheekily reveal your inappropriate characters.

Missing something? Anything wrong?

I'm not an expert and I may easily have forgotten something so if you think something is missing or plainly wrong you know what to do ;-)

And before you ask - I don't have any experience with Pylons, Django, TurboGear, Zope or whatever your favourite Python framework is, so feel free to share with us your experience.