DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Harnessing the Power of AWS Aurora for Scalable and Reliable Databases
  • Tobiko Data: Revolutionizing Data Transformation With SQLMesh
  • Automating Operational Efficiency: Integrating AI Insights From Amazon SageMaker Into Business Workflows
  • Enhancing Performance: Optimizing Complex MySQL Queries for Large Datasets

Trending

  • Performance Optimization Techniques for Snowflake on AWS
  • Memory Leak Due to Time-Taking finalize() Method
  • Streamlining Event Data in Event-Driven Ansible
  • Contextual AI Integration for Agile Product Teams
  1. DZone
  2. Data Engineering
  3. Data
  4. How MySQL 5.7 Handles 'utf8mb4' and the Load Data Infile

How MySQL 5.7 Handles 'utf8mb4' and the Load Data Infile

Alexander Rubin shows how MySQL 5.7 handles UTFMB4 and the load data infile through text and code snippets, and how to insert emojis.

By 
Alexander Rubin user avatar
Alexander Rubin
·
Jul. 07, 16 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
24.3K Views

Join the DZone community and get the full member experience.

Join For Free

In this post, I’ll discuss how MySQL 5.7 handles UTF8MB4 and the load data infile.

Many of my clients have told me that they do not like using the LOAD DATA INFILE statement and prefer to manually parse and load the data. The main reason they do it is issues with the character sets, specifically UTF8MB4 and the load data infile. This was surprising to me as nowadays everyone uses UTF8. MySQL 5.7 (as well as 5.6) has full support for UTF8MB4, which should fix any remaining issues.

Last week, I was investigating an interesting case where we were loading data and got the following error:

mysql -e 'select version()'
+-----------+
| version() |
+-----------+
| 5.7.12    |
+-----------+
$ mysql -vvv testdb < load_data.sql
ERROR 1300 (HY000) at line 1: Invalid utf8mb4 character string: 'Casa N'

The load data statement:

LOAD DATA LOCAL INFILE
                           'input.psv'
                        REPLACE INTO TABLE
                            input
                        CHARACTER SET
                            utf8mb4
                        FIELDS
                            TERMINATED BY '|'
                        LINES
                            TERMINATED BY 'rn'
                        IGNORE
                            1 LINES

The table uses the correct character set (global character set applied to all varchar fields):

CREATE TABLE `input` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  ...
  `address` varchar(255) DEFAULT NULL,
  ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

The string looked like “Casa Nº 24”. So this should be N + U+00BA (MASCULINE ORDINAL INDICATOR, hex code: c2ba). When I do “less input.tsv”, it shows N<BA> 24. So why can’t MySQL load it?

After further investigation, we discovered the original encoding is not UTF8. We found out by running:

$ file -i input.tsv
input.tsv: text/plain; charset=iso-8859-1

So the code <BA> was misleading. Also, when I got the actual character from the file, it was just one byte (UTF8 for this character should be two bytes). When MySQL parsed the UTF8 input file, it found only the first part of the multibyte UTF8 code and stopped with an error.

The original character in hex is “ba”:

xxd -p char_ascii
ba0a

(0a is a carriage return, and “ba” is “masculine ordinal indicator”)

The UTF8 equivalent:

$ xxd -p char_utf8
c2ba0a

This is now two bytes (+ carriage return): c2ba

To solve the problem we can simply change the CHARACTER SET utf8mb4 to CHARACTER SET latin1 when doing a load data infile. This fixed the issue:

Query OK, 2 rows affected (0.00 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0
mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)
mysql> select address from input;
+--------------------------------+
| consignee_address              |
+--------------------------------+
| Casa Nº 24 ................... |
...
+--------------------------------+
2 rows in set (0.00 sec)

Another option will be to detect the character set encoding (iconv can do it) and convert to UTF8.

But it Worked Before…?

It worked a bit differently in MySQL 5.6:

$ mysql -e 'select version()'
+-------------+
| version()   |
+-------------+
| 5.6.25-73.0 |
+-------------+
$ mysql -vvv testdb < load_data.sql
...
Query OK, 2 rows affected, 2 warnings (0.00 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 2
--------------
show warnings
--------------
+---------+------+---------------------------------------------------------------------+
| Level   | Code | Message                                                             |
+---------+------+---------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: 'xBA 24 ...' for column 'address' at row 1  |
| Warning | 1366 | Incorrect string value: 'xBA 24 ...' for column 'address' at row 2  |
+---------+------+---------------------------------------------------------------------+
2 rows in set (0.00 sec)

MySQL 5.7 is more strict and doesn’t allow you to insert data in the wrong format. However, it is not 100 percent consistent. For some characters, MySQL 5.7 will also throw a warning if disabling strict SQL mode.

Another character that caused the same issue was xC9. When loading to MySQL 5.7 with the default sql_mode (ONLY_FULL_GROUP_BY, STRICT_TRANS_TABLES, NO_ZERO_IN_DATE, NO_ZERO_DATE, ERROR_FOR_DIVISION_BY_ZERO, NO_AUTO_CREATE_USER, NO_ENGINE_SUBSTITUTION) it throws an error:

ERROR 1366 (HY000) at line 1: Incorrect string value: 'xC9' for column 'address' at row 1

When disabling the strict mode it now defaults to warnings:

mysql> set global sql_mode = '';
Query OK, 0 rows affected (0.00 sec)
Query OK, 2 rows affected, 1 warning (0.00 sec)
Records: 1  Deleted: 1  Skipped: 0  Warnings: 1
--------------
show warnings
--------------
+---------+------+--------------------------------------------------------------+
| Level   | Code | Message                                                      |
+---------+------+--------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: 'xC9' for column 'address' at row 1  |
+---------+------+--------------------------------------------------------------+
1 row in set (0.00 sec)

Emoji in MySQL

With UTF8MB4 support (in MySQL 5.6 and 5.7), you can also insert a little dolphin into a MySQL table:

CREATE TABLE `test_utf8mb4` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `v` varchar(100) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into test_utf8mb4 (v) values ('Dolphin:��');
Query OK, 1 row affected (0.00 sec)
mysql> select * from test_utf8mb4;
+----+--------------+
| id | v            |
+----+--------------+
|  1 | Dolphin:��   |
+----+--------------+
1 row in set (0.00 sec)

This should help you clear up issues with UTF8MB4 and the load data infile. Have fun!

MySQL Data (computing)

Published at DZone with permission of Alexander Rubin, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Harnessing the Power of AWS Aurora for Scalable and Reliable Databases
  • Tobiko Data: Revolutionizing Data Transformation With SQLMesh
  • Automating Operational Efficiency: Integrating AI Insights From Amazon SageMaker Into Business Workflows
  • Enhancing Performance: Optimizing Complex MySQL Queries for Large Datasets

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!