soledad penadés
repeat 4[fd 100 rt 90]

UTF-8 checklist

Following the discussion in the previous post (Reasons for using UTF-8) I thought it could be interesting to gather a series of steps needed to get a UTF-8 friendly environment.

I'm going to focus on php and mysql, because using mysql and ruby/rails and utf8 tends to be kind of easier (specially since newer Rails versions suggest mysql which charset to use when connecting automagically for you), but the advice can be applied to both platforms in any case.

In your preferred editor

Make sure your editor is set to use UTF-8, specially when editing templates and any other file which is used for building output content. If you include any non-ASCII content (for example, accented words) and they are mixed with more UTF-8 content (from other templates or sources), things will get messed up.

In your html/xml code

Make sure the document's charset is specified.

In HTML documents you would accomplish this with the content/type meta tag, which you should place in the header:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

In XML documents this is done in the XML declaration, which needs to be placed at the immediate beginning of the document:

<?xml version="1.0" encoding="UTF-8"?>

I have heard recommendations for placing the page title immediately after the Content-type has been declared, so that browsers can switch to the right charset immediately if the page title includes any non-ASCII code, but it sounds a little urban myth to me.

I have also seen people recommending to specify the encoding in every form you have in your site, but I haven't found any difference between doing it or not.

In Apache

Make sure the content is being served as UTF-8. A good AddDefaultCharset utf-8 should do. You don't need to change all your hosted content into UTF-8 if you don't want to, charsets can be configured per virtual hosts as well.

In the database

Make sure mysql is configured to use UTF-8 at server level.

Before issuing any other query, and right after you've connected to the server, execute

SET NAMES 'utf8';

It seems otherwise mysql will not recognize properly the character set that the client is using (!!) and will return bad data.

Of course, when you create the tables, make sure that every column which can include text data (this includes TEXT, VARCHAR, etc) is using utf8 as well.

In php

I rarely need to send a header specifying the content-type, but if things are not working totally fine you could also try to manually specify the content type with the header:

header('Content-type: text/html; charset=utf-8');

Or application/xml, whatever you need!

In Flash (or Flex)

Do not use systemCodepage. Flash assumes XML is UTF-8 so if everything else has been properly set up, Flash will be served UTF-8 content, which is what it expects, and we'll be happy :-)

One more thing

Although generally it's a good idea to run a validator, it's even a better idea if you're doing utf-8 stuff and want to make sure you're not outputting bad stuff inadvertently. Validators are picky (even more if we speak about XML validators) and will cheekily reveal your inappropriate characters.

Missing something? Anything wrong?

I'm not an expert and I may easily have forgotten something so if you think something is missing or plainly wrong you know what to do ;-)

And before you ask - I don't have any experience with Pylons, Django, TurboGear, Zope or whatever your favourite Python framework is, so feel free to share with us your experience.

// 4 responses to UTF-8 checklist

Nahuel
Nahuel
20071211

I've worked with django for about 6 month now, I've seen it getting decent unicode support.
First make sure you're using the SVN version, and also make sure you're writing something like http://www.python.org/dev/peps/pep-0263/ says.
Even though django prints objects right with the __str__ method (something like java's toString), don't forget to implement __unicode__ for every model class.
Also keep in mind current python strings get the system encoding, so I don't feel very confident when I write something like "%s %s" (var1, var2). I tend to write u'%s %s' % (var1, var2).
If you are using some extra python package like reportlab for PDF output, make sure you use the unicode(instance), or u'something %s' % obj
That's all I have to do to get proper unicode handling, it's easy though I miss java a little in this particular subject.

winden
winden
20071211

Important steps if you are coding in C using POSIX:

1. Add setlocale(LC_CTYPE,""); at the start of your main function.

2. Use char where you are managing UTF-8, and internally pack/unpack to wchar_t which is a flat 32bits-wide character when doing internal operations. CPU+cache is fast and memory is slow, so take advantage and pack your strings even while in memory.

3. A literal string with UTF-8 enconding:

char *s = "whatever";

4. A literal string with UTF-32 encoding:

wchar_t *s = "whatever";

sole
sole
20071212

I understood that if you just use chars when dealing with utf characters things can easily be broken - for example if uppercasing a string.
Although I haven't checked the wide character string functions (I don't use C for handling strings ;-)) …

winden
winden
20071214

My app was a japanese+english dictionary and stored the lines in utf8 in memory.

Searching a dictionary is nothing more complex than doing a lot of substring compares using strstr, and that's safe to do between two utf8 strings due to the binary encoding.

All other per-char stuff was done with wide chars which is easy due to fixed bitsize chars.

bugfix for 4th point above:

wchar_s *s = L"whatever";

Feel free to leave a reply

Comments are moderated: Rude and offtopic ones are out!