Database

Your Content Doesn't Look Right

gourmet fare

Database - Wrong Character Set

Here are two typical symptoms of a database with the wrong character set.

  • allan1 asked...

    After encountering some weirdness with a modx install that I'd upgraded from 2.2.9 to 2.2.10 I noticed that the database had two differnt character sets, Utf8-general-ci and Latin-Swedish. I decided to do a fresh install, with the result that I had a new all utf8 database, however when I install addons through package manager the character set of their tables is Latin-Swedish. Is this a problem and if so what can I do so that the packages aren't installed with a different character set to the original package?

  • joyobject asked...

    I have made 3 languages for my site which are: English as default, Danish and Thai. There is no problem on Danish language (I can use Scandinavian characters). However, when I switched to Thai. I typed in Thai from content field. What modx render on the web page became '????'.

The problem here is that the database was created with the wrong character set and/or collation. What does this mean? Perhap more importantly, how can it be fixed?

The Solution

In both cases, the solution is to convert the database to use the universal 8-bit characterset known as UTF-8. The best way to do this is with cdc, a script that BobRay of Bob's Guides has developed and kindly made publicly available. The script will not only convert the database and its tables to use UTF-8 by default, but will search through the database and specifically convert every text entry and index with the wrong collation to one of the UTF-8 collations. BobRay's page contains a full description and instructions for its use.

What's the Story?

To the computer, every character that we see is represented by a number. There are a multitude of sets of these numbers arbitrarily assigned to represent the characters of different languages, as well as non-alphanumeric characters such as © or —.

Computers think in numbersPlain-text English (ASCII) and most Western European languages (I rather imagine that this would have been very different if the computer industry had not begun in English-speaking countries) are represented by the first 128 characters of almost all of these sets, so any one of them will work for these languages. But if your language has other characters, or, like Thai, Arabic, Cyrillic and other non-Latin based languages it has a completely different alphabet, many character sets simply don't have representations for them, so you'll get garbage text or simply a ? for each character.

To solve this problem, the aforementioned multitude of character sets was devised. The database will support many, although certainly not all, of the possible character sets. If you wish to have content in Thai in a MySQL database, for example, you can use tis620. Newer versions of MySQL support more character sets.

But what if, like joyobject, you want to have multiple languages? For this, Unicode and its handful of UTF- character sets were developed years ago. In an attempt to dominate and lock in their markets, many of the bigger computer software and hardware companies resisted the adoption of such universal formats for years, forcing their customer base to use their own proprietary character sets and leading to the mish-mash of conflicting character sets we have to deal with.

Collation

Which comes first?Many languages have different characters for different forms of the same letter. For example, in German you can have u as in Muller, or ü as in Müller. Should these be sorted in a certain order, Muller before Müller? Should they be treated the same in a search? If you search for 'Mu%' will you only get Muller, or will you also get Müller? The collation will determine how such sorting and comparisons will be done. It's important to select the correct collation, as even UTF-8 has a number of language-specific collations. The MySQL documentation has a full explanation of its Unicode character sets and collations. You may need to change to the documentation for your version of MySQL to find out what character sets and collations it supports.

So why does MySQL install by default with the latin1 character set, and the latin1_swedish_ci collation? Because MySQL was originally developed and maintained in Sweden, at least until Oracle bought it a few years ago. It's easy enough to change that in the main MySQL engine's configuration file, but most hosting companies don't bother.