MySQL encoding hell: How to export UTF-8 data from a latin1 table

MySQL encoding hell: How to export UTF-8 data from a latin1 table

by José Ribeiro -

After upgrading a client’s Rails application and all its gems, it was finally time to export the MySQL database. For everyone’s surprise, characters beyond ASCII were completely broken, even though the original version of the client’s application was perfectly working with characters with diacritics, chinese or even cyrillic. This post talks about the real problem going underneath the cushy MySQL cover, and how we solved it.

I advise you to avoid jumping directly into the solution and to be perfectly sure that this is also your problem. A “diagnosing method” is described at the end of this document.

Introduction

Everyone that has struggled with character encodings, whether while exporting MySQL databases or simply while dealing with exotic strings in Python (the famous 'ascii' codec can't encode character), knows how hellish it can be: from a simple encoding mismatch to double-encoded data, the possibilities are endless. And the

á日�

gibberish is usually useless to target the source of the problem. And when it comes to MySQL charset and collation problems… It becomes even worse. Because MySQL

  • … has a default database charset and collation.
  • … has a per table charset and collation.
  • … has a per text field charset and collation.

Obviously, this degree of the specification provides MySQL with a great yet troublesome power. As more and more web applications are built with a global user base in mind, UTF-8 has achieved the status of de facto standard when it comes to choosing the encoding to use; however, for applications already out there, the solution usually implies migrating the database from a charset such as latin1, and, in particular, its collation latin1_swedish_ci. Why? Because latin1_swedish_ci is the default collation for the latin1 character set, which in turn is the default character set for MySQL (as of MySQL 5.0).

Most databases are created using these defaults. So was our client’s database. But latin1 is a single-byte charset, so it didn’t make much sense that our client’s app was rendering an enormous range of alphabets correctly.

As a sidenote, I should clarify that MySQL’s latin1 is not [ISO-8859-1][] as one may think, but is in fact [Windows-1252][]. This choice is justified in [MySQL Reference Manual West European Character Sets][] entry:

This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL’s latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d. [ISO-8859-1]: http://en.wikipedia.org/wiki/ISO/IEC_8859-1 “ISO/IEC 8859-1 - Wikipedia, the free encyclopedia” [Windows-1252]: http://en.wikipedia.org/wiki/Windows-1252 “Windows-1252 - Wikipedia, the free encyclopedia” [MySQL Reference Manual West European Character Sets]: http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html “MySQL 5.0 Reference Manual: 10.1.13.2. West European Character Sets”

The problem

The real problem was that UTF-8 data had been inserted inside latin1 fields. Not a single row, obviously: a major part of the database was, in fact, UTF-8. But even when we came to this conclusion, some questions remained: how could UTF-8 data have gotten there in the first place? And how could UTF-8 data inside latin1 fields be rendered correctly (which made this encoding mismatch silent)?

The answer is simple: Rails did it. Well, not Rails per se. Imagine the following sequence of events:

  1. The database is created using MySQL default options: a database with a latin1 charset is born;

  2. A Rails application is developed using the already existent database, where international users post UTF-8 data;

  3. UTF-8 data is treated as a multibyte character sequence, but erroneously sent to MySQL as latin1 (due to a misconfiguration on the charset Rails used to handle the database connection);

  4. MySQL receives the “supposedly” latin1 data, handling it as a single-byte multicharacter sequence, that is stored as is (meaning it suffers no conversion), since the data is supposedly using the same charset as the database.

  5. This also explains why the user could see UTF-8 data: since it is also retrieved as latin1 by Rails, the supposedly single-byte multicharacter sequence will be correctly interpreted as the UTF-8 multibyte character it represents (since no conversion is performed), ending with an UTF-8 webpage.

We tried mysqldump with multiple flags, overriding charsets, but nothing really worked. Some solutions worked partially (showing diacritics but not alphabets other than the latin one). The solutions about to be described are the combined result of several solutions with our specific needs.

The solution

Update: We were told through Twitter by @fhe that MySQL’s utf8 charset was breaking emojis and that he had to use utf8mb4. It turns out MySQL’s utf8 is NOT UTF-8. While valid UTF-8 multi-byte sequences may use up to 4 8-bit bytes MySQL’s utf8 charset supports a maximum of 3 bytes per sequence.
Please replace every occurence of utf8 in the commands below with utf8mb4 for full UTF-8 support.
He also pointed out you may also need to use tables with dynamic row format and use large index prefixes since utf8mb4 uses more space than plain utf8.
Mathias Bynens wrote a really nice article on how to fully support UTF-8 on MySQL.
Thanks @fhe!

If you are 100% sure this is your problem, you may be wondering how to fix it (and I’m sure you know by now that simply doing a mysqldump won’t do the trick). Let’s consider your options:

  1. You want to do it during the dumping process, since you can’t afford to do it in production (or you’re simply migrating and it’s just good timing);

  2. You want to do it on the database itself (be sure to have a full backup just in case). Since we couldn’t afford to do it in production, we chose to do it while exporting the database.

Fixing data on the database itself (inside of the damaged database)

We chose not to perform this kind of solution on our client’s database, but most of the solutions seen the online point on this direction. Wordpress Codex has one of the best documented solutions I’ve seen of this.

The reasoning behind it is that data is actually on a different charset than MySQL thinks it is, and changing it will lead MySQL into an erroneous attempt of conversion into the new charset. By previously changing the field type to a binary field type (which has no charset), we avoid this attempt so that our data is left intact to be correctly interpreted according to its new charset.

It is worth mentioning that this solution needs to be applied for every latin1 field, which on our case would be impractical given the number of fields and tables.

Fixing data outside of Production (outside of the damaged database)

The reasoning behind this solution is related to why the import of a simple mysqldump fails:

  • The dump includes both the schema and data; the schema will create the tables as latin1, while data will be UTF-8! We’ll end up where we started!
  • MySQL will try to convert your data if it thinks it should! Several solutions pointed to the same command sequence:
    1. mysqldump with --skip-set-charset --default-character-set=latin1 flags, to avoid MySQL attempt of reconversion and setting a charset.
    2. Replacing every occurrence of CHARSET=latin1 by CHARSET=utf8 (whether by using sed, perl or any text editor that correctly supports UTF-8 [some editors may break the dump]). This ensures the tables are created with the correct charset.
    3. Restoring the dump.

With this rationale in mind, a combination of these solutions was achieved, with a small adjustment. Since the replacing step should be done on a file or through pipelining, I believe the best way to do it is by separating the schema dump from the data itself. That way, the replace can be performed on the schema dump only while pipelining the data directly to its destiny, without running sed over the data (which should be greater than the schema). This led to the final sequence:

  1. mysqldump of the database schema, either to a file or pipelined with the next 2 steps.
  2. Replace every occurrence of CHARSET=latin1 by CHARSET=utf8 on the schema dump.
  3. Restoring the schema dump on a newly created database (with UTF-8 as default!).
  4. mysqldump of the data with --skip-set-charset --default-character-set=latin1 flags, directly pipelined to the future’s destination mysql session. which translates into:
    # Login into your future database host to create a new database with an UTF-8 charset
    $ mysql -h FUTURE_HOST -u FUTURE_USER -p
    mysql> CREATE DATABASE `FUTURE_DB` DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
    # Flush the current database schema on the future host, replacing all CHARSET=latin1 occurrences along the way
    mysqldump -h CURRENT_HOST -u CURRENT_USER -p CURRENT_DB --no-data --skip-set-charset --default-character-set=latin1 \
    | sed 's/CHARSET=latin1/CHARSET=utf8/g' \
    | mysql -h FUTURE_HOST -u FUTURE_USER -p FUTURE_DB --default-character-set=utf8
    # Flush the current database data on the future host
    mysqldump -h CURRENT_HOST -u CURRENT_USER -p --no-create-db --no-create-info --skip-set-charset --default-character-set=latin1 CURRENT_DB \
    | mysql -h FUTURE_HOST -u FUTURE_USER -p FUTURE_DB --default-character-set=utf8
    

    I left out usual performance flags such as --compress, --single-transaction, --quick or --opt, as well as --max_allowed_packet, but you should consider combining these with your dumps/imports.

    Remarks

    I recommend reading Blue Box’s post on this matter: Blue Box Blog: Getting out of MySQL Character Set Hell: they analyze the problem in fine detail and suggest several solutions, even addressing the problem of double and triple encoding. Socialcast blogs: Rails 2, MySQL, and Character Sets is also a nice reading on the subject. This post should be seen as a mere attempt of providing the reader with a simple explanation, a diagnose method and two solutions, according to the specific needs.

    Side notes

    Diagnosing the problem (a.k.a. “How to tell if this is happening to your database”)

    Querying the database for data will usually produce garbled text; if that is the case, try using the following command on a mysql session, before querying the database again.

    SET SESSION CHARACTER_SET_RESULTS = latin1;
    

    If you’re now seeing your data correctly, this is a strong indicator that your latin1 database suffers from this disease; since the results are now being interpreted as latin1, no conversion is performed and UTF-8 data is now being shown on your terminal.

Attention: I advise you to use the mysql client for querying the data and avoid using phpMyAdmin or any elaborate client for this, since these may override the client’s and results’ charsets, hiding the problem.

Reproducing the problem

Run these commands on a mysql session.

CREATE DATABASE `latin1_test_db` DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci;
USE `latin1_test_db`;
CREATE TABLE IF NOT EXISTS `latin1_test_tbl` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`field` text NOT NULL,
PRIMARY KEY (`id`)
) DEFAULT CHARSET=latin1 AUTO_INCREMENT=1;
SHOW VARIABLES LIKE 'collation_database';
SHOW VARIABLES LIKE 'character_set%';
/* The commands below will allow your mysql client to act as a misconfigured Rails application */
SET SESSION CHARACTER_SET_CLIENT = latin1;
SET SESSION CHARACTER_SET_RESULTS = latin1;
INSERT INTO `latin1_test_tbl` (
`id`,
`field`
)
VALUES (
NULL,
'Coimbra diz Olá! 日本語 is Japanese in Japanese, obviously...'
);
SELECT * FROM `latin1_test_tbl`;

Your data was correctly inserted, as the SELECT command shows. Now end your MySQL client session, login again and query the table for its data (using the last command from above), but this time without setting the session’s character set. Is that garbish familiar?

Major References

#the-forge

José Ribeiro

More posts

Share This Post

Right Talent, Right Time

Turnkey technical teams and solo specialists
when you need them for greatest impact.

Work with us