Using only one simple python script you too can run up an enormous bill in Google’s BigQuery and gain the animosity of your employer! Read on to find out more (SPOILER: I didn’t get fired).
Those nice people at Google recently released a feature called “date-partitioned tables” for BigQuery (which is their platform for storing and querying really stupidly large amounts of data really stupidly quickly). This allows you to slice up a big table of events into chunks of one day at a time, meaning that you can restrict your queries to only relevant days – saving you money – and any data in a partition that doesn’t get updated for more than 90 days will be moved onto long-term storage – saving you money. Apart from that, they look like normal BigQuery tables. Everybody likes saving money, right?
These kind of tables are perfect for continuous streams of events, and we happen to have several streams of event data that we currently process. This data is stored in BigQuery for analysis. One table, in particular, had about 18 months worth of events in it and was roughly 700Gb in size. Remember those facts, they will become relevant later. The majority of our queries against this table only use the last few days’ data, or the last month’s. The rest is queried infrequently, but not so rarely to make you want to remove it altogether.
We wanted to convert this large event table into a partitioned one. Google’s documentation mentions converting a set of per-day tables into a single date-partitioned table, but nothing about taking a single table of timestamped data and converting it. There’s no way to specify that you want to use a particular timestamp in the data as your partition time, unfortunately – it’s always the insert time. Just doing a simple copy into the new table would mean that all the data would be in the same partition because it was all inserted at the same time. While that would be fine if all we wanted was the benefit of long-term storage pricing, that would make using the cost-efficient querying feature almost impossible.
There is one way around this, though. You can target a particular partition for inserts by using table decorators. Using a table name like “events$20160810” you can insert data directly into that partition of your table. Great – I just need query for each day of data in the original table, and write that day’s events into the corresponding partition in the new table. Easy! Come, Boy-Wonder, to the python-script-mobile!
It was quite straightforward to write a python script that gets a list of all the days in the events table, and then loops around querying the table for each day writing that slice of the events into the appropriate partition of the new table. It worked really well, migrating the events into the date-partitioned table in about 45 minutes. I even tried to be clever and made it multi-threaded, making parallel BigQuery API calls. I was very pleased with myself.
Now, the more astute of you in the audience – those of you in particular that are familiar with the way Google charges for BigQuery usage – will have already realized something that I didn’t work out until later.
A few days later, in fact, when my boss became concerned that our BigQuery monthly bill was suddenly $4000 larger than normal. Our usual monthly charges are much, much smaller than that – BigQuery is really a very cheap service for analyzing terabytes of data. His first reaction, bless his faith in my competence, was that there was some sort of billing error on Google’s side. A few emails back-and-forth later, and Google assured him that we had actually run up that bill.
Sensibly, we had enabled the streaming of BigQuery audit logs into BigQuery itself. This allowed us to analyze the queries made over the last few days and work out what had happened. The tables give us useful data like the timestamp of the query, the number of bytes processed, and the SQL executed. A few quick GROUP-BYs and ORDER-BYs and we had a good idea that we’d run a large number of queries over the same table. Over the course of two days, about 1200 queries were run against the events table, each query processing about 700Gb of data.
Remember the size of the events table? 700Gb going back about 18 months. To query each day of that table would take about 500 queries. I did two full runs of my partitioning script – one to see if it worked, another when I added the ever-so-clever multithreading. Add in an extra couple of hundred queries during testing of the script and it was clear that my partitioning script was the culprit. I was only querying one day’s worth of data each time, so why did it cost so much?
Under the hood, BigQuery is a column-based database. That means that the amount of data involved in a query is largely determined by how many columns are included – regardless of how that query is constrained. This is all explained pretty clearly on the BigQuery pricing page – you know, for people that read that sort of thing before embarking on ambitious partitioning adventures. I’d had this explained to me when I first started using BigQuery, and almost immediately forgot it. The nice SQL interface hides the fact that you’re working with something quite different from a mundane database. When I wrote my queries, I thought we would just get charged for querying that day’s worth of data – not for all the data in the table. Google charges a very reasonable $5 per terabyte queried. Each of my 1200 queries were returning all the columns in the original table, so even though I was then restricting the results to a single day I was still using the full 700Gb of data. 1200 x 0.7Tb x $5 = $4200.
Consequently, I was not the most popular member of the team. Which is quite shocking once you realize how handsome and charming I am.
Luckily, Shine Solutions is a generous and forgiving employer and elected to merely administer a comprehensive thrashing rather than fire me. A beating which I soundly deserved and for which I am thoroughly grateful. It’s the only way people like me learn.
Following on from this we have enabled quotas for our account, just in case I haven’t learned my lesson. Google provides tools for managing the amount of data an individual user can query. These settings aren’t enabled by default, and can’t be managed directly by users (you have to send a request that is dealt with by a real person at Google).
There is some good news though – after explaining what we were trying to do, one of the Google technical wizards acknowledged that it was a reasonable use case and that they have plans to support migrating large tables to date-partitioned versions. They hope to have something available by Q4 of this year.