Converting a Movable Type blog from ISO-8859-1 to UTF-8

Publié le :

Let me tell you that moving an MT blog from one server to another is not necessarily a piece of cake, especially if you have the odd idea of switching the charset, from ISO-8859-1 to UTF-8 (Unicode). It took me an awful amount of time, trials and errors, and I'm documenting the process in the hope that it will save time for someone else.

The key thing to keep in mind is that switching from one charset to another, with existing content, is not a matter of changing one setting here or there. It's not because you have modified AddDefaultCharset in Apache or PublishCharset in MT that you set to go. The charset must be consistent all the way through, from the content to the receiving end. This means that the content itself has to be in UTF-8 (possibly converted from another charset), stored in the database, manipulated by your blog software and served by the web server in UTF-8. It's this consistency that can be problematic to achieve, and a source of trouble if not.

First, I transfered all the static files from the source server to the destination one (this is where you may want to upgrade your copy of MT). Then I transferred the database, using mysqldump, connnecting from the new server directly to the source (you may not be able to do so, in wich case, you would have to transfer the resulting file):

mysqldump --default-character-set=latin1 -C -u username -p -h host --opt --skip-add-locks --skip-extended-insert database > mtdump.sql

My source database is using ISO-8859-1. The trap here is to forget to override the charset, because old versions of the mysql client (like the one on my source server) have ISO-8859-1 as the default while new ones (starting with 4.1 I think) use UTF-8.
Another trap may be to use --opt alone, which is supposed to be the best option (it's on by default on recent mysql versions). However, you may face two problems:
- you don't have permission to LOCK tables, therefore the --skip-add-locks option
- if you cannot change the max_allowed_packet variable on your destination server, you may get the following error during the import: ERROR 1153 (08S01): Got a packet bigger than 'max_allowed_packet' bytes, therefore the --skip-extended-insert option which produces a bigger file but with smaller chunks of INSERTs.

Then I converted the SQL dump from ISO-8859-1 to UTF-8:

iconv -f iso-8859-15 -t utf-8 mtdump.sql > mtutf8.sql

[Note: I use ISO-8859-15 here, because of the euro (€) sign. See this tutorial on charsets for more information on the ISO-8859/latin1 family.]

Then I used mysql, setting the proper input charset, to feed the destination database:

mysql -u username -p --default-character-set=utf8 database < mtutf8.sql

Once this is done, you'll need to configure your blogs with the proper settings at destination (notably the paths names), you'll have to rebuild the templates that are linked to files and that use accentuated characters and, of course, rebuild the whole site.

Now the really unfunny traps...

First, supposing that you have a funny firstname like François and the highly stupid idea of writing it as it should be written (with its bells and whistles, accents, cedillas, etc.) in a MT login name, you may have to resort to some trickery to login to your new MT installation. Never use accents in a login name, especially with a product developed in the U.S.! (BTW, TypeKey is quite broken in this regard too, always trying to scramble my first name.)

Along the same lines, if like me your writings go beyond ASCII, and if you are using the dirify attribute to create category folders and file names, you will find that the dirify function is broken (apparently so since MT 3.1). It works with ISO-8859-1 but not with UTF-8, and it will turn all your accents into 'a'. This is quite problematic since UTF-8 is supposed to be the default charset since MT 3.0! To overcome this, you can either resort to use the dirify for Unicode plugin and change to dirify_unicode="1" in the same way one would use dirify="1", or (this is what I did), grab in this plugin the entire my %HighASCII = (...) hash table that sits in sub convert_high_unicode to replace the one in lib/MT/Util.pm. I prefer the latter way, since I hope that Six Apart will eventually fix the bug in dirify.

That's all for tonight, I hope I didn't forget anything big. If you see this post, you've reached the new server. If you note anything strange, please let me know!

P.S.: 1. I've first considered using TypeMover, which makes the attractive promise to handle both the transfer from one MT installation to another and the conversion of charset. But my attempt ended up with one major problem: it converts all the accents into HTML entities, which is a big no-no for me (would you like to edit content where half of the words are scrambled with ugly &blah; blocks?)

2. This tutorial assumes that you are moving your blog on the same major version of MT. If this is not the case, then you'll have to get your content, convert it to UTF-8 then follow the MT upgrade steps. The order is not that important.