Over a million developers have joined DZone.

Importing a Twitter Archive into SQL Server 2014

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks


This article will look at populating an archive of Twitter data into SQL Server 2014. This process can apply to older versions of SQL Server.

Obtaining an NSA™ Approved Twitter Archive

Log onto Twitter using the Web interface (i.e. navigate to http://www.twitter.com). Log in with your account credentials, and then click on the “cog” image on the top right hand side, and select “settings."


click on “Account” from the left hand side menu and the scroll to the very bottom:

Request an archive of your tweets

You’ll be sent an e-mail to your registered e-mail address when the archive is ready to be downloaded. My main Twitter account has nearly 10,000 tweets and was established in January 2007, yet the archive was ready almost instantly. By the time I checked my e-mail account, the link was there.

An e-mail arrives

Reviewing the Archive

The archive is a zip file and contains some interesting content.


The simplest way to access the archive data is to load the supplied index.html into any standard web browser.


The offline archive is completely searchable, and also has some interesting account information, where you can view your User ID (mine’s in the 600,000 range) as well as the date you joined (for me, January 11th, 2007 in a beautiful DB-compatible format).

But Wait ... There's More

If you examine the archive folder contents more closely, you’ll also discover something extremely handy – a file called tweets.csv. In my case, it is 1.7mb in size. Opening the .csv file, I can discern the following column headings:

Column Name Potential Usage Data Type
tweet_id Unique Identifier of a Tweet? Numeric
in_reply_to_status_id In the context of someone else’s tweet Numeric
in_reply_to_user_id Reply to a specific user Integer
timestamp Date/Time of the tweet Date
source Origin of the tweet (e.g. web, mobile device) Unicode String
text Text of the tweet Unicode String
retweeted_status_id Original tweet (if re-tweeted) Numeric
retweeted_status_user_id User of the original tweet Integer
retweeted_status_timestamp  Original date.time of the re-tweeted tweet Date
expanded_urls Original URI (before URL minification) Unicode String

Now, for test data purposes, this is a little light – the data comes in at just over 9800 rows, but it should be useful enough for the time being. Remember that Unicode lengths are double-byte, so I gave the “text” column a precision of 300 (140*2 = 280) just to give some extra space and be on the safe side.

Importing the Data into SQL Server

The first thing I’m going to do is copy the .csv file into my test VM. The rest of this article more or less concerns itself with walking through how to properly use the SQL Server Data Import/Export Tool.


The first thing to do is change the Data Source to “Flat File Source” and then browse for the .csv file. Once you’ve done that, you can verify the data by clicking on the columns option from the left hand side nav. Make sure you set the “Text qualifier” field in the “General” options to quotation mark (“), or you’ll get data conversion issues.


You should click on the “Advanced” option and rename the columns and data types/sizes. I applied the data types listed in the table below, and also set some column lengths/precision.  This worked for me, but your experience may differ. If you’re at all unsure, use the “Suggest Types…” wizard to help with selections.

Column Name Data Type SQL Type Length/Precision Nullable?
tweet_id numeric DT_NUMERIC 20 No
in_reply_to_status_id numeric DT_NUMERIC 20 Yes
in_reply_to_user_id four-byte signed integer DT_I4 Yes
tweet_timestamp date DT_DATE No
source Unicode string DT_WSTR 200 No
tweet_text Unicode string DT_WSTR 300 No
retweeted_status_id numeric DT_NUMERIC 20 Yes
retweeted_status_user_id four-byte signed integer DT_I4 Yes
retweeted_status_timestamp  date DT_DATE Yes
expanded_urls Unicode string DT_WSTR 350 Yes

NOTE: The columns above in bold I have renamed from the original .csv heading names to avoid using SQL Types as column names.


Select “Preview” to do a quick sanity check. Once satisfied, click next to set the destination options.  For these articles, I’ve created a new database called ‘PreviewTest.' When you get to the “source tables” screen, click on the “edit mappings” dialog and just check that everything’s set right.



When ready, execute the import as a new SSIS package. This took me about half an hour to get all the data types right, both for the input definition and the target destination.

image  image

If all goes according to plan, you should have rows inserted into a brand new table.


After a successful import, I can now query from the [tweets] table, and see all my glorious Twitter contributions:


Post-Import Tidy Up

Let’s do some clean-up. Apply the following changes to the table schema:

  • Create a non-clustered PK on tweet_id
  • Create a clustered index on [timestamp]
  • Create a non-clustered index on [text]


Query with Execution Plan


This was a fairly painful exercise for me, as it turns out I’m rather rusty with using the data import wizard. For those of you who are also rusty, here’s some common (and perhaps, in hindsight, obvious) tips:

  • If your data is text qualified, make sure you specify the qualifying value.
  • If you use Unicode types, make sure you specify a double-byte length.
  • Remember to check column names to avoid using SQL or .NET types as names.
  • If something doesn’t look right in the Preview pane, your import isn’t likely to run successfully.

Check back for my next installment where I’ll establish a new schema and start to look at what we can do with SQL Server 2014 and the latest version of the Entity Framework.

Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.


Published at DZone with permission of Rob Sanders, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}