Wrangling Huge .csvs Made Simple (With a Little Help From SQL)

DZone 's Guide to

Wrangling Huge .csvs Made Simple (With a Little Help From SQL)

Large files can cause your machine to crawl and data to get very messy. Here's how TreasureData can help solve such issues.

· Big Data Zone ·
Free Resource

Let’s say you have a very large .csv (containing more than 80k rows) that you want to analyze and visualize.  Usually, given only Excel and Tableau (or Looker, Chartio, etc.), you would have one of two options:

  1. Put it into Excel and run pivot tables, charts, and whatnot, OR
  2. Ingest the data directly into Tableau and try to manipulate it there.

While both tools are excellent choices for their intended use cases, a large enough load would cause Excel to fall down pretty quickly and would similarly become unmanageable to consume directly into a visualization tool like Tableau.

It isn’t scalable to try to ingest this volume of data into a relational database as an intermediary step, either – especially if your data schema isn’t fixed.   That’s why you need a system like Treasure Data that scales automatically for you and provides SQL schema-onread access with schema flexibility and direct connectivity to Tableau.


Let’s look at how to do this.

Note:  These instructions assume that you’ve already set up td toolbelt and have a Treasure Data account.

Preview Your .CSV File

First, you should check the contents of your .csv file. By way of example, I’ll use a fairly large .csv (with >88k rows) containing the NASDAQ index for the last 24 hours. (We can handle much, much bigger files as well!)

$ head -n 5 nasdaq.csv

Ingest Your .CSV into Treasure Data (Using TD Toolbelt)

You’ll need somewhere to ingest the data. Create a database and a table on Treasure Data.

$ td db:create stocks
$ td table:create stocks nasdaq


Then, execute the following commands to upload the .csv file. (NOTE: this will take some time!)

$ td import:auto \
  --format csv --column-header \
  --time-column time \
  --time-format "%Y-%m-%d %H:%M:%S" \
  --auto-create stocks.nasdaq \


Query Your Data into the Format You Want to Visualize

For our visualization, we’ll run a count: using Treasure Data Console, we’ll tally the volume of trades we get by symbol; group the summed volumes by symbol, and sort the symbols in descending order by volume:

  SUM(volume) AS COUNT,

Click the GIF to view the animation:


LIMIT Your Query

That’s a lot of records to try to visualize (2832 to be exact!) so let’s get the top 20 NASDAQ symbols by volume from our dataset.

  SUM(volume) AS COUNT,

Note that the query result is now limited to the top 20 symbols by volume. (Apple is only #3? Really?)

Export Your Data to Tableau

Now that we have that bit working, let’s export our result to a datasource in Tableau. To do this, we’ll fill in the Result Export dialog for Tableau Server.

NOTE: you’ll need a Tableau Server Online account for this step.

For my purposes, I’m using SSL, and Requiring a valid SSL certificate. ‘Datasource’ in this case refers to the one Tableau gets the data from. I’m calling mine ‘nasdaq_demo_john’; for demos, I nearly always use ‘replace’ mode.

Click the GIF to view the animation.


Visualizing Your Data in Tableau

After we log into Tableau Server, we’ll go to home -> data sources and filter for our ‘nasdaq_demo_john’ data source, which we’ll select.

From there, we’ll create a new workbook and drag the following items to build our visualization:

columns: symbol
rows: SUM(count)

Click the GIF to view the animation:


And so…

Want to get the best out of your data analytics infrastructure with the minimum of expense, hassle, and configuration? Then try out Treasure Data.

Our universal data connectors can route almost any data source to almost any data destination. Want an integration we don’t support yet? Ask us!

big data, csv, data, treasure data

Published at DZone with permission of John Hammink , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}