Wrangling Huge .csvs Made Simple (With a Little Help From SQL)
Large files can cause your machine to crawl and data to get very messy. Here's how TreasureData can help solve such issues.
Join the DZone community and get the full member experience.Join For Free
Let’s say you have a very large .csv (containing more than 80k rows) that you want to analyze and visualize. Usually, given only Excel and Tableau (or Looker, Chartio, etc.), you would have one of two options:
- Put it into Excel and run pivot tables, charts, and whatnot, OR
- Ingest the data directly into Tableau and try to manipulate it there.
While both tools are excellent choices for their intended use cases, a large enough load would cause Excel to fall down pretty quickly and would similarly become unmanageable to consume directly into a visualization tool like Tableau.
It isn’t scalable to try to ingest this volume of data into a relational database as an intermediary step, either – especially if your data schema isn’t fixed. That’s why you need a system like Treasure Data that scales automatically for you and provides SQL schema-on–read access with schema flexibility and direct connectivity to Tableau.
Let’s look at how to do this.
Preview Your .CSV File
First, you should check the contents of your .csv file. By way of example, I’ll use a fairly large .csv (with >88k rows) containing the NASDAQ index for the last 24 hours. (We can handle much, much bigger files as well!)
$ head -n 5 nasdaq.csv
Ingest Your .CSV into Treasure Data (Using TD Toolbelt)
You’ll need somewhere to ingest the data. Create a database and a table on Treasure Data.
$ td db:create stocks $ td table:create stocks nasdaq
Then, execute the following commands to upload the .csv file. (NOTE: this will take some time!)
$ td import:auto \ --format csv --column-header \ --time-column time \ --time-format "%Y-%m-%d %H:%M:%S" \ --auto-create stocks.nasdaq \ ./nasdaq.csv
Query Your Data into the Format You Want to Visualize
For our visualization, we’ll run a count: using Treasure Data Console, we’ll tally the volume of trades we get by symbol; group the summed volumes by symbol, and sort the symbols in descending order by volume:
SELECT SUM(volume) AS COUNT, symbol FROM nasdaq GROUP BY symbol ORDER BY COUNT DESC
Click the GIF to view the animation:
LIMIT Your Query
That’s a lot of records to try to visualize (2832 to be exact!) so let’s get the top 20 NASDAQ symbols by volume from our dataset.
SELECT SUM(volume) AS COUNT, symbol FROM nasdaq GROUP BY symbol ORDER BY COUNT DESC LIMIT 20
Note that the query result is now limited to the top 20 symbols by volume. (Apple is only #3? Really?)
Export Your Data to Tableau
Now that we have that bit working, let’s export our result to a datasource in Tableau. To do this, we’ll fill in the Result Export dialog for Tableau Server.
NOTE: you’ll need a Tableau Server Online account for this step.
For my purposes, I’m using SSL, and Requiring a valid SSL certificate. ‘Datasource’ in this case refers to the one Tableau gets the data from. I’m calling mine ‘nasdaq_demo_john’; for demos, I nearly always use ‘replace’ mode.
Click the GIF to view the animation.
Visualizing Your Data in Tableau
After we log into Tableau Server, we’ll go to home -> data sources and filter for our ‘nasdaq_demo_john’ data source, which we’ll select.
From there, we’ll create a new workbook and drag the following items to build our visualization:
columns: symbol rows: SUM(count)
Click the GIF to view the animation:
Want to get the best out of your data analytics infrastructure with the minimum of expense, hassle, and configuration? Then try out Treasure Data.
Our universal data connectors can route almost any data source to almost any data destination. Want an integration we don’t support yet? Ask us!
Published at DZone with permission of John Hammink, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.