By Carsten | March 17, 2008
I’m still looking for new entries. I get quite a few suggestions, but not all of them make it into quiz questions. Do send in your suggestions!
Short questions with somewhat involved answers. In terms of MySQL (and databases in general)…
- What is a character set?
- What is a collation?
- How do they interact?
- Is this something you even need to worry about?
[answer]Answer: Let’s answer the last question first: Yes, this is definitely something you need to worry about. Even if your target audience today is in a single country and speaks a single language, the world around you is definitely moving toward a more flexible handling of characters and languages than you might realize. Who knows where the next data set you’ll be asked to import originated?
With that out of the way…
A character set is a collection of symbols, like “1″, “A”, “+”, “æ”, etc. It defines what those symbols look like, but does not tell us anything about ordering of the characters or how to interpret details like Upper- and lowercase instances (if any) of the same symbol. The ASCII character set defines symbols like A-Z, a-z, 0-9, and a number of other symbols. The Latin-1 character set (which MySQL uses as default) formalizes this a bit further; character sets like UTF-8 contains many more symbols.
A collation defines, within one character set, the ordering of the symbols as well as details defining upper- and lower case instances of the same letter (or, in the case of binary collations, defines that such distinctions should be ignored). Details such as whether “Ó” should be regarded as “O” or “aa” should be considered to “å”* during sorting is also handled by the collation.
Simply put, the combination of character set and collation is what makes up the alphabet of a particular language in a particular region.
When I was teaching for MySQL AB, I often brought along a couple of alphabet blocks like those you see in the picture to explain the difference. A character set roughly correspods to what you have when you tip the blocks from a bag unto the table: a bunch of symbols in no particular order. Once you start ordering the blocks into “A”, “B”, … you attain the sorting order, and hence your first step toward a collation.
One character set may be used to cover several different languages, with closely related characters in their alphabet. Latin-1, for example, covers most if not all of the languages native to the Western hemisphere. But each language within that area imposes different rules on sorting and other interpretations, and one language (German) even defines different sorting orders depending on the particular usage of the data.
Any particular character set has at least one, and often many associated collations; Any particular collation always belongs to one character set only. All of this can be quite daunting at first, but happily MySQL makes it quite easy to not only define default values for your databases, tables and columns; you can also tell MySQL to use different collations on individual queries depending on your exact needs.
*This is the case in Danish, where “aa” (sounds like the English “o”) has quite a different sound than double-a. This “traditional” form that was attempted replaced by the single letter “å” in a reform of the written language in 1948, introducing quite some headaches (not least later on when computers became commonplace) as the new letter was put at the end of the alphabet. Just to add to the trouble, there are instances where two as do appear side-by-side but are pronounced as two as and not å. Besides, this idea never really caught on with many of the more conservative thinkers, so no there’s a movement back to the old form of two as. Which doesn’t make life any easier on the CompSci people.