A couple of months ago, my colleague Graham Polley wrote about how we got started analysing 8+ years worth of WSPR (pronounced "whisper") data. What is WSPR? WSPR, or Weak Signal Propagation Reporter, is signal reporting network set up by radio amateurs for monitoring the ability for radio signals to get from one place to another. Why would I care? I'm a geek and I like data — more specifically, the things it can tell us about seemingly complex processes. I'm also a radio amateur and enjoy the technical aspects of communicating around the globe with equipment that I've built myself.
I also find it mindblowing that you can be heard on the other side of the globe with WSPR using the same amount of power as a mobile phone, when I can't make a phone call on some sections of my commute. In this post, I'm going to pick up where Graham left off and dive further into what the WSPR dataset has to offer, comparing BigQuery and my initial MySQL-based attempt along the way.
When I initially conceived of this idea, I discounted BigQuery as being too big. It felt like using a sledgehammer to crack a walnut. So I fired up MySQL and began to load the data in so I could start querying. The WSPRnet website helpfully has the data going back to March 2008 downloadable for each month, with the current month being updated daily. I downloaded January's dataset on the 10th, by which time it contained approximately 5.5 million entries. This is where the waiting began. It had been several years since I last set up a MySQL installation; although it was relatively painless to get it installed under Ubuntu, I spent probably ten minutes creating users and the initial table structure, by which point, if you read Graham's post, he was already on the home stretch by now. Below is the structure I created in MySQL:
I then began to load the data. Given I was only using ten days of data, I expected this to be a largely painless affair. Some reading of the manual brought me to MySQL's data load command, which after another 20 minutes of manipulating had me loading the data directly from CSV into the table using the following:
load data concurrent local infile '/home/aaron/Downloads/wsprspots-2017-01.csv' replace into table wsprDB.wsprspot fields terminated by ',' (spotid,@spot,reporter,reporters_Locator,snr,frequency,callsign,tx_Locator, tx_power,drift,distance,azimuth,band,version,code) set spot_time=from_unixtime(@spot);
The whole load process took around ten minutes to complete for 5.5 million records. Multiplying this out for the entire dataset, which is 100x larger at 0.5 billion records, it would take around 1,000 minutes or 17 hours to load the data. Whilst there were no time constraints, I wanted to start querying immediately, so I put loading the rest of the data on hold.
What Can We Find Out?
I started querying the data with some simple queries; for instance, for a given WSPR station, what's the average distance at which it can be heard for each frequency? This is readily achievable, you've probably already thought of the SQL, but here it is:
SELECT Band,avg(distance) FROM wsprspots WHERE callsign='M1GEO' GROUP BY Band
And I waited and waited and it took two minutes to return. The astute will notice... I'd forgotten to create indexes on the columns I was querying on — doh! Rectifying this mistake took around 20 minutes of mostly waiting for MySQL to index before my query was returning in ~2 seconds. Better, but still not great given this is only a fraction of the data. This highlights one of the major advantages of BigQuery: you can simply load the data and start querying without having to consciously address such performance concerns.
Now we've warmed up, I've refreshed my memory of SQL, got the data loaded into both MySQL and BigQuery, and both datasets are queryable. It's time to do some real "work" and see what we can find out from the data, comparing BigQuery to my initial MySQL-based approach.
Working Out Where to Transmit
If we want our signal to be heard as far away as possible, we want to transmit on the right frequency/band. It's no use transmitting where no one is listening or at a frequency that doesn't travel very far. We can readily find the average distance for each band from the data with the following SQL:
SELECT Band,avg(Distance) FROM [dataproc-fun:wsprnet.all_wsprnet_data] group by Band order by Band
Running this on MySQL took 97 seconds on the reduced dataset compared to BigQuery, which returned the results in a mere 2.5 seconds for the complete dataset. Wow. BigQuery processed 100x the data in less than one-tenth of the time. The results were downloaded from the console as CSV and plotted using Excel.
Finding the Right Time
As mentioned in the introduction, WSPR can be used to communicate around the world using only milliwatts of power — although I should point out that this does happen at all times of day and on all bands. Let's see if we can find out when we're most likely to hear someone from the data. We could determine if there is a particular time of day when a large number of transmitters can be heard from a particular country. If this is true, it would suggest that we could also use simpler modes, such as voice or Morse code to communicate, rather than the limited messages which WSPR sends. Alternatively, we could also calculate the average distance at which transmitters from a particular country are received. Given I'm based in Australia and all Australian radio amateur calls signs begin with the prefix "VK", we can quickly count the number of other people having heard Australian signals by the time of day with the following query:
SELECT hour(Timestamp) as hour,Band,avg(Distance) FROM [dataproc-fun:wsprnet.all_wsprnet_data] where Call_Sign like 'VK%' group by hour,Band order by Band,hour
Running this on MySQL took around ten seconds to complete on the reduced dataset compared to 2.5 seconds on BigQuery for the full dataset. A win for BigQuery, especially given the MySQL dataset is 1/100 of the full set. And the result? See for yourself...
Here, I've just pulled out the results for a single country frequency, which shows the best time of Day is about 11 AM GMT, or 10 PM AEST. A bit late in the evening, but who needs sleep when you've got radio and BigQuery?
Countries by Time
But what if distance wasn't all we cared about? Many of those in amateur radio like to talk, and so for them, the hobby is all about talking to as many people from different countries and backgrounds as possible. Once more, we can find out from the data which time of day is the best to contact as many countries as possible. As mentioned previously, the first few characters of the call sign determine its country. We can, therefore, group these and count them when at each time of day each country can be heard, as I've done in the following query:
select count(f0_), hour from (SELECT substr(Reporter,0,2),hour(Timestamp) as hour,count(Spot_ID) FROM [dataproc-fun:wsprnet.all_wsprnet_data] WHERE Call_Sign like 'VK%' and band=10 group by f0_,hour, order by f0_,hour) group by hour
As for the times, both returned results after three seconds, which produce the following graph:
This suggests that for the 30m band, after about 6 AM GMT, it doesn't matter what time of day we can hear the same number of prefixes and potentially countries. It's interesting that that MySQL was actually quicker in this query, although this is likely due to the where clause greatly reducing the number of rows MySQL had to process.
We have taken a dive through 8+ years of propagation data to determine how and when we can get the most people to hear us. We've also shown that even for relatively modest problems of only a few GB, BigQuery can outperform MySQL running on modest hardware. Given the ease with which the data can be loaded and manipulated, BigQuery has come out on top in almost every query I performed. BigQuery also gives you 1 TB per month for free, which is more than enough for small side projects like this. By using BigQuery, you can spend more time on the interesting bits of the project rather than optimizing poorly performing databases and queries. I will certainly continue to use BigQuery for the rest of this project which will include a web application to make querying and graphing data a breeze for others, but I'll save that for another post.