Microsoft Codename “Data Transfer” and “Data Hub” May Not Be Ready for Big Data
Update 4/23/2012 7:30 AM PDT: Two members of Microsoft Data Hub/Transfer team reported that they can upload the large test file successfully. Added “Computer/Internet Connection Details” section below. Proceeding with tests to determine maximum file size I can upload in logarithmic row count series: 1,000, 10,000, 100,000 to start to determine if it’s a timeout problem.
Or even MediumData, for that matter.
Neither the Codename “Data Transfer” utility nor Codename “Data Hub” application CTPs would load 500,000 rows of a simple Excel worksheet saved as a 17-MB *.csv file to an SQL Azure table.
The “Data Transfer” utility’s Choose a File to Transfer page states: “We support uploading of files smaller than 200 MB, right now,” but neither preview publishes a row count limit that I can find. “Data Hub” uses “Data Transfer” to upload data, so the maximum file size would apply to it, too.
Both Windows Azure SaaS offerings handled a 100-row subset with aplomb, so the issue appears to be row count, not data structure.
The Creating the Azure Blob Source Data section of my Using Data from Windows Azure Blobs with Apache Hadoop on Windows Azure CTP post of 4/6/2012 described the data set I wanted to distribute via a publicly accessible, free Windows Azure DataMarket dataset. The only differences between it and the tab-delimited *.txt files uploaded to blobs that served as the data source for an Apache Hive table were
- Inclusion of column names in the first row
- Addition of a formatted date field (Hive tables don’t have a native date or datetime datatype)
- Field delimiter character (comma instead of tab)
Following is a screen capture of the first 20 data rows of the 500,000-row On_Time_Performance_2012_1.csv table:
You can download sample On_Time_Performance_YYYY_MM.csv files from the OnTimePerformanceCSV folder of my Windows Live SkyDrive account. On_Time_Performance_2012_0.csv is the 100-row sample file described in the preceding section; On_Time_Performance_2012_1.csv has 486,133 data rows.
Tab-delimited sample On_Time_Performance_YYYY_MM.txt files (without the first row of column names and formatted date) for use in creating blobs to serve as the data source for Hive databases are available from my Flight Data Files for Hadoop on Azure SkyDrive folder.
Provision of the files through a private Azure DataMarket service was intended to supplement the SkyDrive downloads.
Computer/Internet Connection Details:
Intel 64-bit DQ45CB motherboard with Core 2 Quad CPU Q9950 2.83 GHz, 8 GB RAM, 750 GB RAID 1 discs, Windows 7 Premium SP1, IE 9.0.8112.16421.
AT&T commercial DSL copper connection, Cayman router, 2.60 Mbps download, 0.42 Mbps upload after router reboot, 100-Mbps wired connection from Windows 2003 Server R&RA NAT.
Codename “Data Hub” provides visitors with up to four free SQL Azure 1-GB Web databases, so I created a connection to a new On_Time_Performance database:
I then specified the ~500,000-row On_Time_Performance2012_1.csv file for January 2012 as the data source and clicked Upload:
The site provided no indication of any activity, although my DSL router indicated data was being uploaded. After a few minutes, the server disconnected. Reloading the page showed no change in status.
I then tried uploading the 100-row On_Time_Performance_2012_0.csv, which opened the following page after about 10 seconds:
I accepted the suggested data types and clicked Submit, which added the data to the table.
I created a new database in an existing OakLeaf SQL Azure instance because “Data Transfer” doesn’t provide free 1-GB Web databases. I repeated the above process with Codename “Data Transfer” but encountered a bug which prevented use of the # (and presumably other symbols) in the existing database access password:
Selecting On_Time_Performance_2012_1.csv to upload by clicking Analyze caused the app to hang in the Loading … condition:
Canceling the process and selecting 100-row On_Time_Performance_2012_0.csv resulted in the expected Update the Table Settings page appearing in about 10 seconds:
Clicking Save resulted in a Submit Succeeded message.
Neither Codename “Data Hub” nor Codename “Data Transfer” appears to be ready for prime time. Hopefully, a fast refresh will solve the problem because users’ Codename “Data Hub” preview invitations are valid only for three weeks.