DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Which Tool Is Better for Code Completion — Azure Data Studio or dbForge SQL Complete?
  • JSON-Based Serialized LOB Pattern
  • Architecture and Code Design, Pt. 1: Relational Persistence Insights to Use Today and On the Upcoming Years
  • SQL Commands: A Brief Guide

Trending

  • MCP Servers: The Technical Debt That Is Coming
  • Event Driven Architecture (EDA) - Optimizer or Complicator
  • GitHub Copilot's New AI Coding Agent Saves Developers Time – And Requires Their Oversight
  • Scaling Microservices With Docker and Kubernetes on Production
  1. DZone
  2. Data Engineering
  3. Databases
  4. Loading Data Into Azure SQL Data Warehouse

Loading Data Into Azure SQL Data Warehouse

I’m not an ETL expert. In fact, I haven’t done any professional ETL work for several years. My skills are, at best, rusty. With this in mind, I knew I’d have a hard time extracting data from a local database in order to move it up to Azure SQL Data Warehouse. Read on to hear about my journey.

By 
Grant Fritchey user avatar
Grant Fritchey
·
Feb. 18, 16 · Opinion
Likes (6)
Comment
Save
Tweet
Share
14.1K Views

Join the DZone community and get the full member experience.

Join For Free

Let’s start with the level set. I’m not an ETL expert. In fact, I haven’t done any professional ETL work for several years. My skills are, at best, rusty. With this in mind, I knew I’d have a hard time extracting data from a local database in order to move it up to Azure SQL Data Warehouse. I expected to be fumbling and slow and to constantly feel more than a little stupid as I made one mistake after another. All of this came to pass.

Yet, oh my god, was it difficult!

Here’s how I started. I defined a bcp command for the tables I was interested in. I ensured it was working correctly, then wrote a little PowerShell script so I could supply a list of tables and get them all exported at once. Here’s an example:

$tables | foreach-object {
$filename = $_.Substring($_.IndexOf(".") + 1,($_.Length - ($_.IndexOf(".") + 1)))
bcp $_ out "c:\data\bcp\$filename.rtf" -S'WIN-3SRG45GBF97\dojo' -d'ODS' -T -n -c 
}

That worked fine. Next up, I used azcopy to load the files I created up to blob storage. Again, a little Powershell work to move all the files at once:

$tables | foreach-object {
.\AzCopy.exe /Source:c:\data\bcp /Dest:"https://xxx.blob.core.windows.net/sales/data/$_" /DestKey:xx /Pattern:"$_.rtf" /Z:"C:\data\bcp" /Y
}

No issues. Well, OK, I screwed up both these commands like twenty times until I got them right and then did the quick automation. However, the screw-ups were mine, not external issues. Next up, setting up external tables. There’s a good article on how to do this provided from Microsoft. Up until I tried to create the table, everything worked just fine. Then I saw this:

Msg 104051, Level 16, State 1, Line 11
Unsupported data type error. Statement references a data type that is unsupported in Parallel Data Warehouse, or there is an expression that yields an unsupported data type. Modify the statement and re-execute it.

Oh boy.

Luckily, there’s another blog post from Microsoft on getting the schema right. NUMERIC out and DECIMAL in, I got my next error:

Msg 105001, Level 16, State 1, Line 11
CREATE EXTERNAL TABLE failed because the location URI could not be reached. Revise the URI and try again.

My fault. I tried to get fancy and did this:

create external data source AzureStoreageGF
with (type = Hadoop,
location='wasbs://sales/[email protected]', 
credential=AzureStorageCredential);

Changing it to just [email protected] and then making my WITH statement on the CREATE EXTERNAL TABLE command read like this:

WITH (
LOCATION='data/Account/',
DATA_SOURCE=AzureStorage,
FILE_FORMAT=TextFile);

Worked! I’m a winner. All I have to do now is query from the Polybase table I just created.

Msg 107090, Level 16, State 1, Line 132
Query aborted– the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
(/data/Account/Account.rtf)Column ordinal: 1, Expected data type: NVARCHAR(1300) collate SQL_Latin1_General_CP1_CI_AS, Offending value: xxx (Tokenization failed), Error: Not enough columns in this line.

RASSIN FRASSIN… Except I didn’t use those words. I substituted some other Anglo-Saxon terms. Anyway, what’s it mean? It means, probably, although it’s hard to know, that I must have messed up the data export in some fashion. Or, that there’s something in the data that’s breaking the interpretation of the code... only, I can’t find it. So, old school, I opened this up in Excel and found two columns that had carriage returns in them. Fixed that. Going again:

Query aborted– the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 33 rows processed.
(/data/Account/Account.rtf)Column ordinal: 0, Expected data type: DECIMAL(18,2), Offending value: , Error: UTF8 decode failed.

A win! What? You didn’t see the win? You’re looking at the error message. 33 rows processed. Yeah, that’s out of 96,000 rows, but it’s more than 1. That means that 32 rows processed successfully. Now, you might be wondering about that file format. Last time I got that was because I had bcp out to a non-character format. That suggests there’s a data type in there that is giving me more problems in Polybase. Back to SQL Server to see what’s up with this data (because everything worked perfectly in Excel this time). And, found it. This is the problem:

‘å’

I’m using bcp. It goes to a UTF-16 file format. Polybase supports the Hadoop standard of UTF-8. This must be blocking it……. Great. There it is in black & white:

bcp does not support exporting to UTF-8. This may prevent using PolyBase on bcp exported data

Fine. Powershell to the rescue:

Get-ChildItem -rec -fi *.rtf | %{(get-content $_ -enc string) | set-content -enc utf8 $_.fullname}

This time I processed 197 rows before I found another problem:

Query aborted– the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 197 rows processed.
(/data/Account/Account.rtf)Column ordinal: 110, Expected data type: INT NOT NULL, Offending value: |45178|1043866 (Tokenization failed), Error: Too many columns in the line.

What follows is a tedious and dull march through various & sundry text columns that, for some strange reason, have a pipe (|) character in them (why the phone number, by Odin’s Eye Patch, why the phone number). Suddenly… I have 96,000 rows returning from my COUNT(*) query (using Visual Studio, SSMS was twitchy when connected to ADW). Run a CREATE TABLE FROM SELECT query and I have a new table. Ta-da.

Let’s bottom line this. Data is dirty. Any ETL process is likely to expose the trashy data, so what are the real issues here? First up, Powershell is my best-est buddy. Scripting everything through Posh right out of the gate made a huge difference in my ability to constantly reset. The fact that our most common processes spit out UTF16, while not a problem, is something you need to get on top of (see Posh above). AZCopy works really well and the command line is easy to implement, but you’ll need to go through the process of setting up all the appropriate connections. Polybase is wicked cool. Yes, I had to hop through hoops to get the data clean, but, what you should note in the above descriptions, at no point did I need to go and recreate my EXTERNAL TABLE from Azure SQL Data Warehouse. I just kept updating the files through my scripts and as soon as they were loaded up to blob storage, I ran a query against them. No additional work required. That’s nice. You’re going to want to go through your data types on your data before you start the migration. You’re going to have to plan for a lot of time cleaning your data if you haven’t already.

In summary, this process is well documented, but has a lot of moving parts and lots of places for things to break down. If you’re like me and ETL is a distant memory, plan for a very steep learning curve. If you’re good at ETL, but not so hot at automation, your mouse clicking finger is going to get tired. Microsoft has a decent overview of the process posted here.

Now, I need to get started cleaning the data on the next table… pity me.

Data (computing) Database sql Data warehouse azure Data Types

Published at DZone with permission of Grant Fritchey, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Which Tool Is Better for Code Completion — Azure Data Studio or dbForge SQL Complete?
  • JSON-Based Serialized LOB Pattern
  • Architecture and Code Design, Pt. 1: Relational Persistence Insights to Use Today and On the Upcoming Years
  • SQL Commands: A Brief Guide

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!