Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Gobbling Up Big(ish) Data for Lunch Using BigQuery

DZone's Guide to

Gobbling Up Big(ish) Data for Lunch Using BigQuery

Google BigQuery is a good fit when you're only able to analyze a small part of the data using a local MySQL instance. The best part? It has incredible speed.

· Big Data Zone
Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

Beers + WSPR = Fun

To this day, I’m a firm believer in the benefits of simple, informative, and spontaneous conversations with my colleagues — at least with the ones who can stand me long enough to chat with me. Chewing the fat with other like-minded folks over a beer or two is a bloody good thing. It’s how ideas are born, knowledge is shared, and relationships are formed. It’s an important aspect of any business that is sadly all too often overlooked.

Image title

No, that’s not us at the gathering.

At a recent gathering of Shiners (yes, there were beers involved), I got chatting with fellow Shiner Aaron Brown. We were nonchalantly discussing the client projects that we were currently working on (and bad-mouthing our manager Luke like we always do), when the topic turned to our side projects.

Aaron began to tell me about his keen interest in the Weak Signal Propagation Reporter (WSPR) network and how he was looking for a good and fast way to analyze all the public data that they had available for download on the website. 

He had some cool ideas for analyzing the data, but so far had only been able to analyze a small part of the data using a local MySQL instance. I was quick to tell him that we’d been really impressed by Google BigQuery on our projects at Telstra, and how I thought it might be just the ticket. It would easily be able to handle the amount of data he was looking at — and with $300 free credits on Google Cloud Platform, Aaron would easily be able to analyze the WSPR data to his heart’s content (Google charge $5 per TB queried in BigQuery).

We agreed that when we both had some spare time, we’d give it a shot and try to load it into BigQuery so he could begin analyzing it. The data Aaron wanted to look at was about eight years' worth — 500M rows and roughly 60GB. That's not a huge amount by any stretch of the imagination, but it's enough to warrant a BigQuery solution, in my opinion. It sounded like a good plan, and more importantly, it sounded like great fun.

One hangover — eh, I mean day — later, I was fumbling around on my machine trying to cause some trouble during my lunch break, when I thought about doing the WSPR load for Aaron. It should take only an hour, I thought.

I was wrong. It took less than 45 minutes.

Entrée: Spin Up a Google Compute Engine Instance

Duration: About 30 seconds.

Actually, I didn’t even need to spin up a GCE instance (VM). The Google Cloud console has a really cool feature called Cloud Shell. It gives you command-line access to your own personal GCE under the hood, directly in the browser. It even comes with the gcloud tools and components out-of-the-box. Awesome.

screen-shot-2017-01-18-at-12-11-52-pm

Mains: Eight Lines of Bash

Duration: ~30 minutes.

The WSPR download page clearly lists all the files available for download. The data goes back to 2008-03 historically. So, a simple bash script with a loop, and wget was all I needed to fetch the files. Once the files were downloaded, then a simple call to gsutil to upload them to a bucket (wsprnet) in Google Cloud Storage — and that was that.

#!/usr/bin/env bash 
START=2008-03 
while [ "$START" != 2017-02 ]; do
 echo $START 
 wget -P /etc/wsprnet/ http://wsprnet.org/archive/wsprspots-${START}.csv.gz 
 START=$(date -j -v +1m -f "%Y-%m" ${START} +%Y-%m) 
done
gsutil -m cp /etc/wsprnet/* gs://wsprnet/

Dessert: load data from GCS into BigQuery

Duration: ~10 minutes.

Once the bash script had finished, I had all the data sitting nice and pretty in GCS. I had downloaded the compressed (GZ) versions of the files so it would be quicker. I knew from some previous work that I’d done with BigQuery that it supports loading GZIP files in from GCS. Too easy.

screen-shot-2017-01-18-at-12-28-58-pm

All I needed to do next was to load the data and files into a table in BigQuery. So, I flicked over to the web UI for BigQuery and configured the following simple load job:

screen-shot-2017-01-18-at-12-33-25-pm

That took just 10 minutes to load 500M rows and about 60GB of data. Nice. The final result of that load job was all the WSPR data now waiting to be analyzed by Aaron!

screen-shot-2017-01-13-at-1-51-32-pm

screen-shot-2017-01-13-at-2-01-10-pm

Digestive

Duration: one minute.

Tools like BigQuery are commoditizing big data. It’s so easy and quick to get up and running nowadays, no matter the size of the data you want to analyze. Who wants to be mucking around with Hadoop or Hive clusters or on-prem database instances when you can use tools like these? Seriously.

Stay tuned for part two of this post, when Aaron will tell us all about how his analysis went using BigQuery. Oh, and by the way, we’ve made the table in BigQuery public, so anyone can access/analyse it. You can find it here.

Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.

Topics:
bigquery ,big data ,data analytics

Published at DZone with permission of Graham Polley, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}