Hello! I’m Lars, Data Engineer at Unacast. In this post I’ll be diving into my experiences working with Google BigQuery for the past year. I’ll point out some of the ways working with a modern tool is different from working with the more traditional “Business Intelligence” tools I used to deal with as a consultant, highlight a few advantages, and warn about certain pitfalls that we’ve come to discover. And as a former consultant, I will of course give the conclusion away right at the beginning: Working with BigQuery is great! But if you don’t pay attention to what you’re doing, it will make you regret it.
Unacast is, as you may already know, a data company building a platform for location data to better understand the real world. We collect data from smartphones in the US with the help of our partners, process and enrich the data, and sell the resulting data products to clients in the advertising and research industry. As a Data Engineer, my job is to make sure that the collection, processing, storage and reporting of the data goes smoothly. And as a startup with an eye to the future we are of course doing it all in the cloud, using Google Cloud Platform and Google BigQuery as our primary database and query engine.
BigQuery is a Software-as-a-Service query engine specifically designed to handle large volumes of data. It is serverless, fully managed and easy to use: All you need to do is to load your structured data into it and Google will handle the rest. You query the data using SQL, and the system provide you with results within very reasonable time frames, scaling automatically with the size of your data and complexity of your queries. It does this by cleverly distributing the data and processing over the many machines in Google’s huge data centers. Unlike in traditional relational databases, the data isn’t indexed. Instead, BigQuery relies on very fast full scanning of the data that is referenced in queries. And this bring us to where things start getting complicated: The pricing model. In addition to a small fee for data storage, you pay per query for the amount of data scanned.
A recurring problem I faced as consultant making interactive reports for (impatient) business users were queries that took too long to complete. To improve the execution times and avoid losing the interest of our users there were a couple of things we could try, the first of which was to re-write queries and/or optimize the database for the most frequent ones. If that didn’t work, we would try materializing aggregates to use in the slowest queries. The last resort was to upgrade the database itself, if the budget would allow it. Naturally we would try everything before even bringing that idea up to the client. I remember coming across BigQuery in one of these moments, but I was skeptical to its promises of performance and ease of use. Another reason why we never tried it then was how hard it was to estimate how much it would cost to put it to use, not knowing how much querying the users of the reporting solution would be doing in practice.
Now that I’ve been using BigQuery for a year I can vouch for most of the claims of how well BigQuery works. Using it is very simple, and it has surprised us on several occasions by successfully dealing with nasty queries we didn’t think we’d get away with. It’s good for ad-hoc data exploration, scheduled bulk transforms that can be formulated in SQL and as the engine for interactive reports. Changes can easily be made to our business logic as we, for the most part, can query the raw data directly. We also store our data de-normalised and don’t have to get caught up in data modelling. Beyond not being able to sort large data sets we simply aren’t hitting any walls with BigQuery.
The main advantage we get from using BigQuery is the increased speed at which we can prototype. We can get the results we want quickly, perhaps by taking shortcuts here and there, since BigQuery scales under the hood to whatever we throw at it. And if is often tempting to deploy things to production when the results are there. It then becomes very convenient that BigQuery scales so well, as we can feel confident that what works today will keep working tomorrow with the ever-increasing volumes of data. But that’s the thing, sometimes it might have been better if it didn’t. With so little friction on a day-to-day basis it is easy to forget all about the pricing model, which is one of those pitfalls I mentioned earlier.
My father is an editor and offered me the following analogy to me when I (as the good son I am) tested my material for this post on him. The transition from conventional relational databases to BigQuery is like the transition from analog to digital film making.
Film is expensive. Back when they relied on it directors would always have to plan their shots carefully to not waste film and blow their budgets. When the editors took over in the editing room they would benefit from that planning and could piece shots together accordingly. Then digital video came along and things changed. Directors no longer had to worry as much about the cost of filming and started experimenting more on the set, resulting in much less thought-out material in large quantities. The editors had to spend much more time and energy piecing the film together than before. With BigQuery something similar is happening. We, the directors in the analogy, don’t have to be as conservative as we used to and are led to sending more work to the query engine, or the editing room. It’s nice for us as we feel liberated from the shackles of the past, but also very nice for our liberator Google, who made us run a lot more complex queries and is gladly charging us for each of them.
To avoid falling into a pattern of overspending we have to be conscious about what we are doing. We can query directly on the raw data, but should we? Our report works, but should we deploy it to production and become dependant on it? Flexibility is handy, but the value of planning and finding efficient solutions isn’t going away. We shouldn’t rely on the query engine to do all the dirty work for us just because we can.
I can conclude that after working with BigQuery for one year it feels like the game has changed. One bottleneck has been eliminated and we get quicker results now, but the underlying challenges of data management are still there. It’s easy to let the principles of best practice go when you don’t immediately feel the consequences — Everything still works but the monthly bill suddenly grows out of control. I don’t miss working with traditional databases, but the things I learned back then are still as important today. And with the rate the volumes of data increase today we probably couldn’t do without tools like BigQuery!
Finally, a quick example: I wanted to run a stress test to underpin the claim that BigQuery “always works”, so I ran a query that counts the number of distinct IDs in all of our raw data from the past 30 days. The query scanned one column of 94 billion rows, 3.2 TB of data, and cost 16 USD. A pretty nasty query by our standards, that took 13 minutes to complete. I repeated the query a couple of times (you know, just in case). If I were to run this experiment once an hour every day at work the costs would exceed my salary! Clearly, with the great freedom of BigQuery comes its share of responsibility.