Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Wrangling Huge .csvs Made Simple (With a Little Help From SQL)

DZone's Guide to

Wrangling Huge .csvs Made Simple (With a Little Help From SQL)

Large files can cause your machine to crawl and data to get very messy. Here's how TreasureData can help solve such issues.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Let’s say you have a very large .csv (containing more than 80k rows) that you want to analyze and visualize.  Usually, given only Excel and Tableau (or Looker, Chartio, etc.), you would have one of two options:

  1. Put it into Excel and run pivot tables, charts, and whatnot, OR
  2. Ingest the data directly into Tableau and try to manipulate it there.

While both tools are excellent choices for their intended use cases, a large enough load would cause Excel to fall down pretty quickly and would similarly become unmanageable to consume directly into a visualization tool like Tableau.

It isn’t scalable to try to ingest this volume of data into a relational database as an intermediary step, either – especially if your data schema isn’t fixed.   That’s why you need a system like Treasure Data that scales automatically for you and provides SQL schema-onread access with schema flexibility and direct connectivity to Tableau.

Data-Munging

Let’s look at how to do this.

Note:  These instructions assume that you’ve already set up td toolbelt and have a Treasure Data account.

Preview Your .CSV File

First, you should check the contents of your .csv file. By way of example, I’ll use a fairly large .csv (with >88k rows) containing the NASDAQ index for the last 24 hours. (We can handle much, much bigger files as well!)

$ head -n 5 nasdaq.csv
head_nasdaq_csv

Ingest Your .CSV into Treasure Data (Using TD Toolbelt)

You’ll need somewhere to ingest the data. Create a database and a table on Treasure Data.

$ td db:create stocks
$ td table:create stocks nasdaq

create_db_create_table

Then, execute the following commands to upload the .csv file. (NOTE: this will take some time!)

$ td import:auto \
  --format csv --column-header \
  --time-column time \
  --time-format "%Y-%m-%d %H:%M:%S" \
  --auto-create stocks.nasdaq \
  ./nasdaq.csv

upload_csv

Query Your Data into the Format You Want to Visualize

For our visualization, we’ll run a count: using Treasure Data Console, we’ll tally the volume of trades we get by symbol; group the summed volumes by symbol, and sort the symbols in descending order by volume:

SELECT 
  SUM(volume) AS COUNT,
  symbol
FROM
  nasdaq
GROUP BY
  symbol
ORDER BY
  COUNT DESC

Click the GIF to view the animation:

visualize_query_1

LIMIT Your Query

That’s a lot of records to try to visualize (2832 to be exact!) so let’s get the top 20 NASDAQ symbols by volume from our dataset.

SELECT 
  SUM(volume) AS COUNT,
  symbol
FROM
  nasdaq
GROUP BY
  symbol
ORDER BY
  COUNT DESC LIMIT 20

Note that the query result is now limited to the top 20 symbols by volume. (Apple is only #3? Really?)

Export Your Data to Tableau

Now that we have that bit working, let’s export our result to a datasource in Tableau. To do this, we’ll fill in the Result Export dialog for Tableau Server.

NOTE: you’ll need a Tableau Server Online account for this step.

For my purposes, I’m using SSL, and Requiring a valid SSL certificate. ‘Datasource’ in this case refers to the one Tableau gets the data from. I’m calling mine ‘nasdaq_demo_john’; for demos, I nearly always use ‘replace’ mode.

Click the GIF to view the animation.

result_export_tableau

Visualizing Your Data in Tableau

After we log into Tableau Server, we’ll go to home -> data sources and filter for our ‘nasdaq_demo_john’ data source, which we’ll select.

From there, we’ll create a new workbook and drag the following items to build our visualization:

columns: symbol
rows: SUM(count)

Click the GIF to view the animation:

viz_data_tableau

And so…

Want to get the best out of your data analytics infrastructure with the minimum of expense, hassle, and configuration? Then try out Treasure Data.

Our universal data connectors can route almost any data source to almost any data destination. Want an integration we don’t support yet? Ask us!

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
data ,big data ,csv ,treasure data

Published at DZone with permission of John Hammink. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}