When MySQL encodings go bad
In this MySQL forum post, Erik runs into the confusing situation whereby he has to tell mysqldump to use the Latin-1 character set in order to get UTF-8 output. Bizarre magic? Well, no.
In any database-driven application, the encoding needs to be specified in several different places. If you’re a good developer, you’ll be using some Unicode encoding, quite possibly UTF-8. The problems arise when you send those UTF-8 bytes to your database.
It used to be the case (I don’t know if it still is) that the default encoding used by MySQL, unless you specified otherwise, would be Latin-1 (the collation rules applied would be those for Sweden, incidentally). Out of the box, with the defaults left alone, all of the text fields in all of your tables will be specified as containing Latin-1. In addition, your database connections will also use Latin-1. Amazingly, though, everything will seem to work.
The reason for this is twofold. First, Latin-1 is an 8-bit encoding in which all of the values (save for 0, the NUL byte) are valid characters. This means that any byte in any other encoding is also a valid Latin-1 character: but (in the case of UTF-8) they’re only going to actually refer to the same character within the range 0-127. For example, Unicode character U+00A3 is encoded as a single byte (0xA3) in Latin-1, but as two bytes (0xC2 0xA3) in UTF-8. If you told a piece of software that those two bytes were in fact two Latin-1 characters, it would have no reason to doubt you, and that’s exactly what happens when you store a pound (sterling) sign in a Latin-1 encoded field: MySQL is perfectly happy to accept the bytes 0xC2 0xA3 as Latin-1, because it’s entirely reasonable for those bytes to appear in a Latin-1 string.
The second reason why it works is that if your field is Latin-1 and your connection encoding is also Latin-1, MySQL need make no attempt at converting between character sets. It’s storing the bytes you give it, and it returns them to you as-is when you perform queries. MySQL hands you back what it interprets as two Latin-1 characters, and your application assumes it’s getting a UTF-8 string back again. Everything will appear to work.
The problems arise as soon as you do ask MySQL to perform conversions for you. Maybe you’re not expecting it to have to, but MySQL sees a mismatch somewhere and attempts to convert. In Erik’s situation, he asked MySQL for his data encoded as UTF-8. MySQL saw that the requested character set (UTF-8) didn’t match the field character set (Latin-1) and so set about transcoding all of his data from Latin-1 to UTF-8, producing what’s known in the business as “complete and utter junk”. What MySQL didn’t know, because nobody had told it, was that the data in those tables was really UTF-8 in the first place and no conversion was required. When Erik demanded mysqldump output Latin-1 instead, MySQL didn’t bother to perform any conversion because the character sets matched (even though both were incorrect) and so it spat out the original UTF-8 bytes, albeit while claiming it to be Latin-1.
It would be easy to pin the blame on MySQL for this: after all, if its defaults were to store UTF-8 by default, Erik wouldn’t have had a problem. I’m not sure I agree, though. First of all, people should read the documentation. Second, in the western world, Latin-1 is still the most widely used encoding, and used as a (sane, in a manner of speaking) default where encoding is left unspecified. If Erik wasn’t trying to do the right thing—in this day and age, amazingly still a rarity—everything would have talked Latin-1 uniformly, and he wouldn’t have had a problem until he tried to store “優雅” in his database.
So, the bottom line: if mysqldump produces junk, force it to output in the encoding MySQL thinks you’re using to store the data (rather than the encoding you really are using to store the data). After you’ve done that, and produced a properly-formatted data file, go forth and kick yourself for believing all of the “MySQL is easy” hype. It’s not. At least, it’s not if you want to do things properly, just like everything else.