Debugging charset encoding mismatch with Apache

Published on:

While setting up a new weblog using UTF-8 as the default encoding charset, I spent literally hours trying to figure out why my first name persisted to show up as François instead of François. Not that I'm not used to it already, but I have this foolish hope that computers should eventually facilitate our life.

It turned out that despite a correct definition of the charset encoding in all pages (<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />), some pages (output from CGI scripts) would be recognized as carrying the proper encoding while others (HTML, PHP) were always reported as having an ISO-8859-1 charset.

Thanks to the excellent Web Developer toolbar for Firefox, I found out that certain pages had a charset definition superposed on them via a Content-Type HTTP header (See headers in Tools > Web Developer > Information > View Response Headers, very handy). After more digging, I found that the pages that were behaving properly would already provide a Content-Type header, which turned my suspicion to the brand new Apache 2 installation on my server.

Bingo! Apache 2 now ships with a default AddDefaultCharset directive that forces the charset to ISO-8859-1 when one is not provided in the headers by an external module (such as a script). Since the HTTP headers have precedence on the META headers tag in the HTML code, this basically voids your efforts to provide this information within HTML pages.

This has been flagged, with merit, as the Apache bug 23421 (see also Apache bug 14513).

If you experience this odd behavior, what you have to do is find the AddDefaultCharset in your httpd.conf and change it to this:

AddDefaultCharset Off

This will prevent Apache 2 to override the charset encodings that you provide through META tags. Apache 1.3.x ships without this directive, which means it's off by default. You should have Apache force the charset only in very specific cases, but that should never be the default behavior IMHO.