How to Create an Incremental SQL Azure Data Source for U.S. Air Carrier Flight Delays Dataset
Join the DZone community and get the full member experience.
Join For FreeBackground
My initial U.S. Air Carrier Flight Delays, Monthly dataset for the Windows Azure Marketplace DataMarket was intended to incorporate individual tables for each month of the years 1987 through 2012 (and later.) I planned to compare performance of datasets and Windows Azure blob storage a persistent data sources for Apache Hive tables created with the new Apache Hadoop on Windows Azure feature. I used Microsoft Codename “Data Transfer” to create the first two of these SQL Azure tables, On_Time_Performance_2012_1 and On_Time_Performance_2012_2, from corresponding Excel On_Time_Performance_2012_1.csv and On_Time_Performance_2012_2.csv files in early May 2012.
Subsequently, I discovered that the Windows Azure Publishing Portal had problems uploading the large (~500,000 rows, ~15 MB) On_Time_Performance_YYYY_MM.csv files. I was advised by Microsoft’s Group Program Manager for the DataMarket that the *.csv upload feature would be disabled to “prevent confusion.” For more information about this issue, see my Microsoft Codename “Data Transfer” and “Data Hub” Previews Don’t Appear Ready for BigData post updated 5/5/2012.
A further complication was the suspicion that editing the current data source to include each additional table would require a review by a DataMarket proctor. An early edit of one character in a description field had caused my dataset to be offline for a couple of days.
A workaround for the preceding two problems is to create an on-premises clone of the SQL Azure table with a RowID identity column and recreate the SQL Azure table without the identity property on the RowID column. Doing this permits using a BULK INSERT instruction to import new rows from On_Time_Peformance_YYYY_MM.csv files to the local SQL Server 2012 table and then use George Huey’s SQL Azure Migration Wizard (SQLMW) v3.8.7 or later to append new data to a single On_Time_Performance SQL Azure table. Managing primary key identity values of an on-premises SQL Server table is safer and easier than with SQL Azure.
The downside of this solution is that maintaining access to the 1-GB SQL Azure Web database will require paying US$9.99 per month after your free trial expires. Microsoft provides free SQL Azure storage when you specify a new database in the Windows Azure Marketplace Publishing Portal.
This post describes the process and T-SQL instructions for creating and managing the on-premises SQL Server [Express] 2012 databases, as well as incrementally uploading new data to the SQL Azure database.
Creating a SQL Azure On_Time_Performance Table
If you don’t have an SQL Azure subscription, sign up for the Windows Azure Three-Month Free Trial, which includes a 1-GB SQL Azure Web database, and use the Windows Azure Management Portal to create an SQL Azure server instance and the SQL Azure Management Portal to add a database (named On_Time_Performance for this example.)
Tip: Convert your subscription to Pay-Per-Use before the free trial expires, if you want to maintain access to the instance’s database(s).
Following is the T-SQL script generated by SQL Server Management Studio 2012 for the single On_Time_Performance table with the Include Indexes option set to true and NOT NULL specified for all fields:
/****** Object: Table [dbo].[On_Time_Performance] ******/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE TABLE [dbo].[On_Time_Performance]( [RowId] [bigint] NOT NULL, [Year] [int] NOT NULL, [Month] [int] NOT NULL, [DayofMonth] [int] NOT NULL, [FlightDate] [datetime] NOT NULL, [Carrier] [nvarchar](256) NOT NULL, [Origin] [nvarchar](256) NOT NULL, [Dest] [nvarchar](256) NOT NULL, [DepDelayMinutes] [int] NOT NULL, [ArrDelayMinutes] [int] NOT NULL, PRIMARY KEY CLUSTERED ( [RowId] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ) GO /****** Object: Index [IX_ArrDelayMinutes] ******/ CREATE NONCLUSTERED INDEX [IX_ArrDelayMinutes] ON [dbo].[On_Time_Performance] ( [ArrDelayMinutes] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) GO SET ANSI_PADDING ON GO /****** Object: Index [IX_Carrier] ******/ CREATE NONCLUSTERED INDEX [IX_Carrier] ON [dbo].[On_Time_Performance] ( [Carrier] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) GO /****** Object: Index [IX_DepDelayMinutes] ******/ CREATE NONCLUSTERED INDEX [IX_DepDelayMinutes] ON [dbo].[On_Time_Performance] ( [DepDelayMinutes] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) GO SET ANSI_PADDING ON GO /****** Object: Index [IX_Dest] ******/ CREATE NONCLUSTERED INDEX [IX_Dest] ON [dbo].[On_Time_Performance] ( [Dest] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) GO /****** Object: Index [IX_FlightDate] ******/ CREATE NONCLUSTERED INDEX [IX_FlightDate] ON [dbo].[On_Time_Performance] ( [FlightDate] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) GO /****** Object: Index [IX_Month] ******/ CREATE NONCLUSTERED INDEX [IX_Month] ON [dbo].[On_Time_Performance] ( [Month] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) GO /****** Object: Index [IX_Origin] ******/ CREATE NONCLUSTERED INDEX [IX_Origin] ON [dbo].[On_Time_Performance] ( [Origin] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) GO /****** Object: Index [IX_Year] ******/ CREATE NONCLUSTERED INDEX [IX_Year] ON [dbo].[On_Time_Performance] ( [Year] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) GO
Note: SQL Azure requires all tables to have a clustered primary key; the RowKey identity column is the same as that for the schema that’s generated if you specify a New Table when appending rows from *.csv files with Codename “Data Transfer” or “Data Hub”, as well as the DataMarket’s current import from *.csv file feature. Indexes are required on each field of the dataset that you specify as queryable.
Open SQL Azure Management Studio [Express] 2012, log in to the SQL Azure server (e3895m7bbt.database.windows.net for this example), provide your administrative UserID (with @ and the server ID appended) and Password:
Connect to open the connection, click New Query to add an empty query editor window, copy and paste the preceding T-SQL DDL instruction to the window, and click Execute to create the new table and its indexes:
Creating an On-Premises SQL Server Clone Table
The local clone table needs a RowID primary key with the identity property but doesn’t require indexes. Connect to your local instance of SQL Server [Express] 2012, add a new database (On_Time_Performance for this example), open a new query window and paste the following code to it:
/****** Object: Table [dbo].[On_Time_Performance] ******/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE TABLE [dbo].[On_Time_Performance]( [RowId] [bigint] identity NOT NULL, [Year] [int] NOT NULL, [Month] [int] NOT NULL, [DayofMonth] [int] NOT NULL, [FlightDate] [datetime] NOT NULL, [Carrier] [nvarchar](256) NOT NULL, [Origin] [nvarchar](256) NOT NULL, [Dest] [nvarchar](256) NOT NULL, [DepDelayMinutes] [int] NOT NULL, [ArrDelayMinutes] [int] NOT NULL, PRIMARY KEY CLUSTERED ( [RowId] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ) GO
Execute the query and expand the Tables node to verify it’s schema:
Importing *.csv Data with the BULK IMPORT Command
SQL Server’s bulk copy process (bcp) utility is the most common method of migrating data from *.csv files and well as between SQL Server tables on WANs. However, bcp’s command-line syntax is somewhat arcane and today’s SQL Server DBAs and developers generally prefer T-SQL DDL and DML statements executed from GUIs.
Following is a T-SQL DML statement that uses a BULK IMPORT instruction
USE [On_Time_Performance] GO CREATE TABLE dbo.[FlightDataTemp]( [Year] [int] NOT NULL, [Month] [int] NOT NULL, [DayofMonth] [int] NOT NULL, [FlightDate] [datetime] NOT NULL, [Carrier] [nvarchar](256) NOT NULL, [Origin] [nvarchar](256) NOT NULL, [Dest] [nvarchar](256) NOT NULL, [DepDelayMinutes] [int] NOT NULL, [ArrDelayMinutes] [int] NOT NULL) GO BULK INSERT dbo.[FlightDataTemp] FROM 'C:\Users\Administrator\Documents\FlightData\On_Time_Performance_2012_1.csv' WITH ( FIELDTERMINATOR =',', FIRSTROW = 2 ) GO INSERT INTO dbo.[On_Time_Performance] ( [Year], [Month], [DayofMonth], [FlightDate], [Carrier], [Origin], [Dest], [DepDelayMinutes], [ArrDelayMinutes]) SELECT [Year], [Month], [DayofMonth], [FlightDate], [Carrier], [Origin], [Dest], [DepDelayMinutes], [ArrDelayMinutes] FROM dbo.[FlightDataTemp] GO DROP TABLE [FlightDataTemp] GO
Note: The FIRSTROW = 2 parameter skips the first row, which contains column names.
Using an intermediate temporary table eliminates the need for a bcp format file to prevent mapping the RowID identity column to a field in the *.csv file. It also enables inserting rows from several *.csv files and uploading the collective rows in a single SQLAzureMW operation.
Select the local SQL Server instance in Object Explorer, create a new query editor window, copy and pastd the preceding T-SQL statements to it, edit the file path and name to suit the location and name of your *.csv file, and click Execute to run the query:
Note: Adding about 500,000 rows locally took about 10 seconds on my test machine.
Verify the rows added from Excel’s row count; the rows affected count should be one less than the number of worksheet rows:
Execute a SELECT TOP(12) * FROM dbo.On_Time_Performance ORDER BY RowId DESC query and compare the resultset with the the last 12 rows of the Excel worksheet:
Add the rows from one or more additional *.csv files. Adding rows from On_Time_Performance_2012_2.csv brings the total row count to slightly less than one million rows.
Uploading Data to the SQL Azure Table with SQL Azure MW
Download, install the latest SQL Azure MW version (v3.8.7 of 4/24/2012 when this post was written), open SQLAzureMW.exe.config in Notepad or Visual Studio, search for ScriptTableAndOrData, and change the value ScriptOptionsTableSchemaData to ScriptOptionsTableData:
Save your changes, run SQLAzureMW.exe from the C:\Program Files (x86)\SQLAzureMW folder and select the SQL Database option:
Click Next to open the Connect To Server dialog, select the on-premises instance in the Server Name list, accept the default Master DB option:
Click Connect, select the On_Time_Performance database in the list and click Next to open the Choose Objects page. Accept the default Script All Database Objects option and click the Advanced button to confirm the Data Only choice you specified in the Config file:
Click OK and Next to view the Script Wizard Summary, and click Next and Yes when asked if you’re Ready to Generate SQL Script to copy data from the table to a local *.csv file:
Note: Writing the 950,959 rows of the table took about 4.2 seconds.
The full bcp.exe command is:
bcp.exe "[On_Time_Performance].[dbo].[On_Time_Performance]" out "c:\SQLAzureMW\BCPData\dbo.On_Time_Performance.dat" -E -n -T -S OL-WIN7PRO23\SQLEXPRESS
Click Next to open the Connect to Server dialog for the SQL Azure database, type the server name, your User Name with an @servername suffix and Password, accept the default Master DB option:
Click Connect to open the Set Up Target Server Connection page, select On_Time_Performance, click Next, and click Yes when asked if you want to Execute Script Against Destination Server to start the data upload process. Click the On_Time Performance tab to display upload progress:
The full bcp command for uploading the data in 10,000-row batches is:
bcp.exe "On_Time_Performance.dbo.On_Time_Performance" in "c:\SQLAzureMW\BCPData\dbo.On_Time_Performance.dat" -E -n -b 10000 -a 16384 -q -S e3895m7bbt.database.windows.net -U "RogerJ@e3895m7bbt" -P xxxxxx
Here’s the final upload report:
… To be continued later tonight or tomorrow morning.
Opinions expressed by DZone contributors are their own.
Trending
-
Auto-Scaling Kinesis Data Streams Applications on Kubernetes
-
Google Becomes A Java Developer's Best Friend: Instantiations Developer Tools Relaunched For Free
-
Top 10 Pillars of Zero Trust Networks
-
Managing Data Residency, the Demo
Comments