Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

UTF-8 Data on Latin1 Tables: Converting to UTF-8 Without Downtime

DZone's Guide to

UTF-8 Data on Latin1 Tables: Converting to UTF-8 Without Downtime

· Performance Zone
Free Resource

Evolve your approach to Application Performance Monitoring by adopting five best practices that are outlined and explored in this e-book, brought to you in partnership with BMC.

This post comes from  at the MySQL Performance Blog.

Here’s a problem some or most of us have encountered. You have a latin1 table defined like below, and your application is storing utf8 data to the column on a latin1 connection. Obviously, double encoding occurs. Now your development team decided to use utf8 everywhere, but during the process you can only have as little to no downtime while keeping your stored data valid.

CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
master> SET NAMES latin1;
master> INSERT INTO t (c) VALUES ('¡Celebración!');
master> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
1 row in set (0.00 sec)
master> SET NAMES utf8;
master> SELECT id, c, HEX(c) FROM t;
+----+---------------------+--------------------------------+
| id | c                   | HEX(c)                         |
+----+---------------------+--------------------------------+
|  3 | ¡Celebración!     | C2A143656C656272616369C3B36E21 |
+----+---------------------+--------------------------------+
1 row in set (0.00 sec)

One approach here is as described to the manual is to convert the TEXT column into BLOB, then convert the table character set to utf8 and the c column back to TEXT, like this:

master> ALTER TABLE t CHANGE c c BLOB;
master> ALTER TABLE t CONVERT TO CHARACTER SET utf8, CHANGE c c TEXT;
master> SET NAMES utf8;
master> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
1 row in set (0.00 sec)

All good so far, but, if the tables are too big or big enough to disrupt your application significantly without downtime, this becomes a problem. The old little trick of using slaves now comes into play. In a nutshell, you can convert the TEXT column first on a slave into BLOB, then switch your application to use this slave as its PRIMARY. Any utf8 data written via replication or from the application should be stored and retrieved without issues either via latin1 connection character set or otherwise. This is because the BINARY data type does not really have character sets. Let me show you:

slave> SET NAMES latin1;
slave> INSERT INTO t (c) VALUES ('¡Celebración!');
slave> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
1 row in set (0.00 sec)
slave> ALTER TABLE t CHANGE c c BLOB;
slave> SET NAMES latin1;
slave> INSERT INTO t (c) VALUES ('¡Celebración!');
slave> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  4 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
2 rows in set (0.00 sec)
slave> SET NAMES utf8;
slave> INSERT INTO t (c) VALUES ('¡Celebración!');
slave> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  4 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  5 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
3 rows in set (0.00 sec)

As you can see, while the column is still in BLOB, I have no problems reading or storing utf8 data into it. Now, after your application has been configured to use this slave and use utf8 connection, you can now convert the column and the table back to TEXT and utf8 character set.

slave> ALTER TABLE t CONVERT TO CHARACTER SET utf8, CHANGE c c TEXT;
slave> SET NAMES utf8;
slave> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  4 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  5 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
3 rows in set (0.00 sec)

Some caveats though, you cannot replicate from BLOB or utf8 back to the latin1 column, so you will have to discard the data from the original master. Doing so will just result in double encoding. Second, while the column is in BLOB or any other BINARY type and this column is indexed, you may experience different results when the index is used. This is because BINARY data is indexed based on their numeric values per bytes not per character strings. Here is an example:

master> SHOW CREATE TABLE t \G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` blob,
  PRIMARY KEY (`id`),
  KEY `c` (`c`(255))
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
master> SET NAMES latin1;
master> INSERT INTO t (c) VALUES ('¡Celebración!'), ('Férrêts being fërøcîóúß'), ('Voyage à Montreal');
master> SELECT c FROM t ORDER BY c;
+---------------------------------+
| c                               |
+---------------------------------+
| ¡Celebración!                   |
| Férrêts being fërøcîóúß         |
| Voyage à Montreal               |
+---------------------------------+
3 rows in set (0.00 sec)
master> ALTER TABLE t CHANGE c c BLOB;
master> SELECT c FROM t ORDER BY c;
+---------------------------------+
| c                               |
+---------------------------------+
| Férrêts being fërøcîóúß         |
| Voyage à Montreal               |
| ¡Celebración!                   |
+---------------------------------+
3 rows in set (0.00 sec)

See how the results are now ordered differently?

What’s your utf8 horror? Share with us on the comments below :-)



Learn tips and best practices for optimizing your capacity management strategy with the Market Guide for Capacity Management, brought to you in partnership with BMC.

Topics:

Published at DZone with permission of Peter Zaitsev, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}