DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations

Bulk Insert Issue With UTF-8 Fixed-Width File Format

A SQL expert gives a tutorial on working with UTF-8 and Unicode in your database. If you've ever run into this issue, then you're going to want to check out this post.

Randolph West user avatar by
Randolph West
·
Sep. 16, 18 · Presentation
Like (3)
Save
Tweet
Share
7.94K Views

Join the DZone community and get the full member experience.

Join For Free

Fellow Canadian Doran Douglas brought this issue to my attention recently, and I wanted to share it with you as well.

Let's say you have a file in UTF-8 format. What this means is that some of the characters will be single-byte, and some may be more than that.

Where this becomes problematic is that a fixed-width file has fields that are, well, fixed in size. If a Unicode character requires more than one byte, it's going to cry havoc and let slip the dogs of truncation.

Take a look at this sample:

Curaçao                 531CUW
Canada                  124CAN
Åland Islands           248ALA
United States of America840USA

Notice that the first field (the Country name) is a fixed 24 characters, and the two code fields are three characters in length each.

Also note that the first and third lines have Unicode characters: the ç and the Å specifically.

What happens if you use BULK INSERT or the bcp command line tool?

The fixed-width parser has been told that the Country name is 24 characters wide, so it brings in the eight bytes of Curaçao followed by 16 bytes of the empty string (the ç takes up two bytes because it is Unicode).

Then the next field of three characters is brought in, comprising an empty space, and 53. Finally, the third field is brought in as 1CU. The parser knows there is a line break, so it imports the W into its own field as Country name and nothing in the next two fields. Following another line break, Canada is imported successfully, along with its two code fields.

In the case of Åland Islands, the problem repeats itself and only portions of the next two fields are imported correctly, with the final A ending up on its own line. The final line is imported successfully again.

Curaçao                  |  53 | 1CU
W                        |     |
Canada                   | 124 | CAN
Åland Islands            |  24 | 8AL
A                        |     |
United States of America | 840 | USA

Depending on the rules for import, this is the best case scenario if the source file is fixed-width and cannot be modified. Other variations result in much worse imports.

So How Do We Fix This?

The first and best option is to have the source file correctly formatted. That is, if there are Unicode characters in only some of the lines, the file generator should compensate by adding or removing trailing spaces accordingly, if it can.

In Doran's case, this was not possible because he had no control over the source file. Instead, he and I independently came up with the solution to import the entire file into a single column in SQL Server using the bulk loader, and then parse the fields based on length using Transact-SQL.

Here are three different solutions that I came up with. Because I tried three different ways, I used a staging table:

USE tempdb;

DROP TABLE IF EXISTS CountryStaging;

-- Single column for all three imported fields
CREATE TABLE CountryStaging (Line NVARCHAR(512));

-- Into staging table
BULK INSERT CountryStaging
FROM 'C:\Temp\CountryTestDataFixedWidth.txt'
WITH (ROWTERMINATOR = '\n');
GO

-- First method: SELECT with SUBSTRING()
SELECT
    SUBSTRING([Line], 0, 25) AS CountryName,
    SUBSTRING([Line], 25, 3) AS CountryCodeNumeric,
    SUBSTRING([Line], 28, 3) AS CountryCodeAlpha
FROM
    dbo.CountryStaging;

-- Second method: SELECT with LEFT and RIGHT
SELECT
    CAST(LEFT([Line], 24) AS NVARCHAR(255)) AS CountryName,
    CAST(LEFT(RIGHT([Line], (6)), (3)) AS CHAR(3)) AS CountryCodeNumeric,
    CAST(RIGHT([Line], (3)) AS CHAR(3)) AS CountryCodeAlpha
FROM
    dbo.CountryStaging;

-- Third method: Add in persisted computed columns
ALTER TABLE dbo.CountryStaging ADD CountryName AS (LEFT([Line], 24)) PERSISTED;
ALTER TABLE dbo.CountryStaging ADD CountryCodeNumeric AS (LEFT(RIGHT([Line], (6)), (3))) PERSISTED;
ALTER TABLE dbo.CountryStaging ADD CountryCodeAlpha AS (RIGHT([Line], (3))) PERSISTED;
GO

-- SELECT with computed columns
SELECT
    CountryName,
    CountryCodeNumeric,
    CountryCodeAlpha
FROM
    CountryStaging;

DROP TABLE IF EXISTS CountryStaging;

Doran's own solution was similar to my first solution using SUBSTRING(), but without the staging table.

Have you been bitten by Unicode/UTF-8? Tell me your horror stories and successes here in the comments.

UTF-8

Published at DZone with permission of Randolph West, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Comparing Map.of() and New HashMap() in Java
  • Spring Cloud
  • Apache Kafka Is NOT Real Real-Time Data Streaming!
  • 10 Easy Steps To Start Using Git and GitHub

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: