DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > Wrangling Huge .csvs Made Simple (With a Little Help From SQL)

Wrangling Huge .csvs Made Simple (With a Little Help From SQL)

Large files can cause your machine to crawl and data to get very messy. Here's how TreasureData can help solve such issues.

John Hammink user avatar by
John Hammink
·
Apr. 11, 16 · Big Data Zone · Tutorial
Like (6)
Save
Tweet
3.00K Views

Join the DZone community and get the full member experience.

Join For Free

Let’s say you have a very large .csv (containing more than 80k rows) that you want to analyze and visualize.  Usually, given only Excel and Tableau (or Looker, Chartio, etc.), you would have one of two options:

  1. Put it into Excel and run pivot tables, charts, and whatnot, OR
  2. Ingest the data directly into Tableau and try to manipulate it there.

While both tools are excellent choices for their intended use cases, a large enough load would cause Excel to fall down pretty quickly and would similarly become unmanageable to consume directly into a visualization tool like Tableau.

It isn’t scalable to try to ingest this volume of data into a relational database as an intermediary step, either – especially if your data schema isn’t fixed.   That’s why you need a system like Treasure Data that scales automatically for you and provides SQL schema-on–read access with schema flexibility and direct connectivity to Tableau.

Data-Munging

Let’s look at how to do this.

Note:  These instructions assume that you’ve already set up td toolbelt and have a Treasure Data account.

Preview Your .CSV File

First, you should check the contents of your .csv file. By way of example, I’ll use a fairly large .csv (with >88k rows) containing the NASDAQ index for the last 24 hours. (We can handle much, much bigger files as well!)

$ head -n 5 nasdaq.csv
head_nasdaq_csv

Ingest Your .CSV into Treasure Data (Using TD Toolbelt)

You’ll need somewhere to ingest the data. Create a database and a table on Treasure Data.

$ td db:create stocks
$ td table:create stocks nasdaq

create_db_create_table

Then, execute the following commands to upload the .csv file. (NOTE: this will take some time!)

$ td import:auto \
  --format csv --column-header \
  --time-column time \
  --time-format "%Y-%m-%d %H:%M:%S" \
  --auto-create stocks.nasdaq \
  ./nasdaq.csv

upload_csv

Query Your Data into the Format You Want to Visualize

For our visualization, we’ll run a count: using Treasure Data Console, we’ll tally the volume of trades we get by symbol; group the summed volumes by symbol, and sort the symbols in descending order by volume:

SELECT 
  SUM(volume) AS COUNT,
  symbol
FROM
  nasdaq
GROUP BY
  symbol
ORDER BY
  COUNT DESC

Click the GIF to view the animation:

visualize_query_1

LIMIT Your Query

That’s a lot of records to try to visualize (2832 to be exact!) so let’s get the top 20 NASDAQ symbols by volume from our dataset.

SELECT 
  SUM(volume) AS COUNT,
  symbol
FROM
  nasdaq
GROUP BY
  symbol
ORDER BY
  COUNT DESC LIMIT 20

Note that the query result is now limited to the top 20 symbols by volume. (Apple is only #3? Really?)

Export Your Data to Tableau

Now that we have that bit working, let’s export our result to a datasource in Tableau. To do this, we’ll fill in the Result Export dialog for Tableau Server.

NOTE: you’ll need a Tableau Server Online account for this step.

For my purposes, I’m using SSL, and Requiring a valid SSL certificate. ‘Datasource’ in this case refers to the one Tableau gets the data from. I’m calling mine ‘nasdaq_demo_john’; for demos, I nearly always use ‘replace’ mode.

Click the GIF to view the animation.

result_export_tableau

Visualizing Your Data in Tableau

After we log into Tableau Server, we’ll go to home -> data sources and filter for our ‘nasdaq_demo_john’ data source, which we’ll select.

From there, we’ll create a new workbook and drag the following items to build our visualization:

columns: symbol
rows: SUM(count)

Click the GIF to view the animation:

viz_data_tableau

And so…

Want to get the best out of your data analytics infrastructure with the minimum of expense, hassle, and configuration? Then try out Treasure Data.

Our universal data connectors can route almost any data source to almost any data destination. Want an integration we don’t support yet? Ask us!

Database Relational database Data (computing) sql

Published at DZone with permission of John Hammink, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Back to Basics: Accessing Kubernetes Pods
  • Everything You Need to Know About Web Pentesting: A Complete Guide
  • Sprint Goals: How to Write, Manage, and Achieve
  • 10 Best Infrastructure-as-Code Tools for Automating Deployments in 2022

Comments

Big Data Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo