Monday, June 16, 2014

Using PHP’s mb_detect_encoding to cleanup your data

Ever heard of iso-8859-1 ? Yeah… that nightmare… With it, my name ends up more often than not… Sébastien. The computers gurus came up one day with UTF-8 and all our problems should have been solved; one encoding to rule them all.

Sweet, let’s all switch to UTF-8 ! Oh wait… legacy projects… PHP internal encoding is still not UTF-8 and functions like strlen() are still not able to properly process multi-bytes strings. It is being said that UTF-8 should land in PHP 6, but in the mean time, we still have to do something.

I am currently working on a big project with a lot of spaghetti-legacy code with a lot of entry points to the database. Almost all the data is stored in one big table, but not encoded uniformly. When I started working on it, tables had fields with a combinaison of  ascii_bin, utf8_general_ci, latin_general_ci and latin_swedish_ci … and we are in Canada ! In all those fields, data was stored with absolutely no guaranty of its encoding. Data was retrieved and passed through a series of UTF-8 encode/decode and stuff like this:

if (strpos($string, 'é') !== false) {
   $string = utf8_decode($string);

I eventually managed to change every field, change all database connections and remove and traces of encode/decode. However, I still had the problem of having data not encoded properly in some rows/columns. You may or may not be familiar with mb_detect_encoding, here is a very simple trick:

Once you have detected the encoding, use iconv to convert it. This crunched through my 1GB database in no time and I was then sure that everything was in UTF-8.

Of course, this is an example with a database, it can work with any data. This script could also be faster if all updates where done at the same time for each row.

Please, save yourself some trouble, make sure all user content is in UTF-8.