DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

How are you handling the data revolution? We want your take on what's real, what's hype, and what's next in the world of data engineering.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • Leetcode: Improving String Performance in Swift
  • RION - A Fast, Compact, Versatile Data Format

Trending

  • Testing Java Applications With WireMock and Spring Boot
  • Microservice Madness: Debunking Myths and Exposing Pitfalls
  • The Rise of the Intelligent AI Agent: Revolutionizing Database Management With Agentic DBA
  • Scrum Smarter, Not Louder: AI Prompts Every Developer Should Steal

Bulk Insert Issue With UTF-8 Fixed-Width File Format

A SQL expert gives a tutorial on working with UTF-8 and Unicode in your database. If you've ever run into this issue, then you're going to want to check out this post.

By 
Randolph West user avatar
Randolph West
·
Sep. 16, 18 · Presentation
Likes (3)
Comment
Save
Tweet
Share
8.7K Views

Join the DZone community and get the full member experience.

Join For Free

Fellow Canadian Doran Douglas brought this issue to my attention recently, and I wanted to share it with you as well.

Let's say you have a file in UTF-8 format. What this means is that some of the characters will be single-byte, and some may be more than that.

Where this becomes problematic is that a fixed-width file has fields that are, well, fixed in size. If a Unicode character requires more than one byte, it's going to cry havoc and let slip the dogs of truncation.

Take a look at this sample:

Curaçao                 531CUW
Canada                  124CAN
Åland Islands           248ALA
United States of America840USA

Notice that the first field (the Country name) is a fixed 24 characters, and the two code fields are three characters in length each.

Also note that the first and third lines have Unicode characters: the ç and the Å specifically.

What happens if you use BULK INSERT or the bcp command line tool?

The fixed-width parser has been told that the Country name is 24 characters wide, so it brings in the eight bytes of Curaçao followed by 16 bytes of the empty string (the ç takes up two bytes because it is Unicode).

Then the next field of three characters is brought in, comprising an empty space, and 53. Finally, the third field is brought in as 1CU. The parser knows there is a line break, so it imports the W into its own field as Country name and nothing in the next two fields. Following another line break, Canada is imported successfully, along with its two code fields.

In the case of Åland Islands, the problem repeats itself and only portions of the next two fields are imported correctly, with the final A ending up on its own line. The final line is imported successfully again.

Curaçao                  |  53 | 1CU
W                        |     |
Canada                   | 124 | CAN
Åland Islands            |  24 | 8AL
A                        |     |
United States of America | 840 | USA

Depending on the rules for import, this is the best case scenario if the source file is fixed-width and cannot be modified. Other variations result in much worse imports.

So How Do We Fix This?

The first and best option is to have the source file correctly formatted. That is, if there are Unicode characters in only some of the lines, the file generator should compensate by adding or removing trailing spaces accordingly, if it can.

In Doran's case, this was not possible because he had no control over the source file. Instead, he and I independently came up with the solution to import the entire file into a single column in SQL Server using the bulk loader, and then parse the fields based on length using Transact-SQL.

Here are three different solutions that I came up with. Because I tried three different ways, I used a staging table:

USE tempdb;

DROP TABLE IF EXISTS CountryStaging;

-- Single column for all three imported fields
CREATE TABLE CountryStaging (Line NVARCHAR(512));

-- Into staging table
BULK INSERT CountryStaging
FROM 'C:\Temp\CountryTestDataFixedWidth.txt'
WITH (ROWTERMINATOR = '\n');
GO

-- First method: SELECT with SUBSTRING()
SELECT
    SUBSTRING([Line], 0, 25) AS CountryName,
    SUBSTRING([Line], 25, 3) AS CountryCodeNumeric,
    SUBSTRING([Line], 28, 3) AS CountryCodeAlpha
FROM
    dbo.CountryStaging;

-- Second method: SELECT with LEFT and RIGHT
SELECT
    CAST(LEFT([Line], 24) AS NVARCHAR(255)) AS CountryName,
    CAST(LEFT(RIGHT([Line], (6)), (3)) AS CHAR(3)) AS CountryCodeNumeric,
    CAST(RIGHT([Line], (3)) AS CHAR(3)) AS CountryCodeAlpha
FROM
    dbo.CountryStaging;

-- Third method: Add in persisted computed columns
ALTER TABLE dbo.CountryStaging ADD CountryName AS (LEFT([Line], 24)) PERSISTED;
ALTER TABLE dbo.CountryStaging ADD CountryCodeNumeric AS (LEFT(RIGHT([Line], (6)), (3))) PERSISTED;
ALTER TABLE dbo.CountryStaging ADD CountryCodeAlpha AS (RIGHT([Line], (3))) PERSISTED;
GO

-- SELECT with computed columns
SELECT
    CountryName,
    CountryCodeNumeric,
    CountryCodeAlpha
FROM
    CountryStaging;

DROP TABLE IF EXISTS CountryStaging;

Doran's own solution was similar to my first solution using SUBSTRING(), but without the staging table.

Have you been bitten by Unicode/UTF-8? Tell me your horror stories and successes here in the comments.

UTF-8

Published at DZone with permission of Randolph West, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Leetcode: Improving String Performance in Swift
  • RION - A Fast, Compact, Versatile Data Format

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: