Building a proper pipeline that scales well, concerning performance and cost, with regards to data volume, might sound easy, but when you start to push terabytes of data through the pipes, it begins to get complicated and can be costly. Our Senior Director of Software Engineering Tomas Jansson explains.
Being a data company, building data pipelines is something we do every day, but that does not mean we knew how to do it well from day one. Building a proper pipeline that scales well concerning performance and cost with regards to data volume might sound easy, but when you start to push terabytes of data through the pipes, it begins to get complicated and can be costly.
When we first began implementing our data pipeline three years ago we had a couple of 100 interactions per day, fast forward three years and we now have billions of interactions, meaning terabytes of data every single day.
Our pipeline can be divided into three main components:
This post will focus on the processing pipeline since the other two components are not pipelines in the same way, and they have their own set of challenges.
When we process data today we do so in daily batches, mainly because of simplicity but also since we do not have any real time requirements at the moment.
The data we get at Unacast is location data from devices. This data is processed by device, which turns out to be a challenge when you do it on 3 billion interactions in one go. With this background, I will show you how have we done it in the past, how are we doing it in our current pipeline, and how will we do it in the future pipeline followed up by some learnings.
When we created the first processing pipeline it was all about getting something out to the market as quick as possible to verify the product. This first pipeline was running 100% on BigQuery using a tool we built in golang to orchestrate the pipeline. At this point in time we did not have a lot of data, so every day we processed all the data to get an as accurate result as possible. Since we processed all data every single day we knew that it would not scale in terms of cost and performance. This was a conscious choice based on how fast we wanted to move and the fact that we did not process that much data at that point in time.
The current pipeline is about one year old, of course, there has been minor modifications to it during this period but the way we process the data is basically the same as it was one year ago.
When we started implementing it we made a switch from BigQuery to Dataflow, since Dataflow seemed to be exactly the tool we needed. To orchestrate the pipeline we have also implemented some supporting AppEngine apps, and cloud functions.
The Dataflow pipeline has been running nicely and served us well, especially after the shuffle support was added.
The volume has now reached a point where it is quite obvious that the strength of Dataflow is not to group really large amounts of data the way we do now. That kind of grouping that we need is both costly and slow to do in Dataflow.
Another pain point is the orchestration part. The orchestration works fine and we rarely have serious production errors, but it is hard to understand how it works since it is scattered over multiple projects and applications.
The future pipeline will again move major parts of the pipeline back from Dataflow to BigQuery for two main reasons:
What we do lose by going back to BigQuery is some testability. Testing will require more work since we now have to execute the BigQuery queries and check the result instead of just running unit tests in java.
Doing this changes to the pipeline will in our case cut cost with 75%-90% for the piece of the pipeline that is replaced.
We will still keep some complex processing in Dataflow, but that will be done on already grouped data, meaning that Dataflow will also scale linearly in this scenario. We can probably move that logic to BigQuery as well, but since we want it to be easy to test we will keep it in Dataflow for now.
We will also deal with our second pain point, the orchestration. Not so long ago Composer, hosted Apache Airflow, was released on GCP. Moving all our orchestration to Composer will give us a visual representation of how our pipeline works and we will also be possible to deprecate a lot of existing orchestration code.
So what have we learned so far about building data pipelines on GCP?
This first learning applies to data pipelines on any platform. When we first get data we add a lot of metadata to the data we receive, this data is then kept through the whole pipeline. Keeping this data through the whole pipeline is definitely not a problem when you have gigabytes of data each day, but when you scale to terabytes of data it starts to hurt. You will have to pay for that data everywhere you store the data, and all the processing will also be slower, or more expensive since the memory footprint of the data you process will be much higher.
In the future pipeline, we will also drop fields we need in the output before the processing step and then join the data back in before delivery to have as small as a memory footprint through the pipeline as possible. If you do grouping on a lot of data this can be a deal breaker since there are limits in BigQuery on how large a row can be.
Our main components for building data pipelines is BigQuery and Dataflow, they were fairly new to us when we started and we had to learn as we go. Knowing your tools and how they scale as the volume grows is crucial, both in terms of cost and performance.
If you can find something that enables your solution to scale linearly with the volume then you are in a good position, and for us this thing is BigQuery. BigQuery charges you based on the data volume you process and not how much CPU you use. This does not mean you can go haywire since there is limitations on how much CPU you can use, but as long as you are under that limit everything is good.
As mentioned above we have been using Dataflow a lot, and it is an awesome product as long as you use it where it makes sense, and it does not make sense to use dataflow to do large scale grouping or lookups. If you are in doing large scale grouping or lookups you should consider moving all data to BigQuery and try to do those operations there.
The last tool we have just introduced is Cloud Composer, https://cloud.google.com/composer/, which is built on top of Apache Airflow, https://airflow.apache.org/index.html. In hindsight, we should probably have moved to Airflow earlier and just host it ourselves, but with the newly added hosted version from Google Cloud it was a no-brainer to make the decision. Using Airflow to orchestrate your pipeline will give you a visible representation of the pipeline and pipeline runs, easy retries, backfill and more.
To make sure it all made sense, and we didn’t waste our time, we started off implementing a proof of concepts where a lot of it had to be run manually. The proof of concept pipeline was a simplified version of the pipeline, but it still processed the relevant data volumes in each step. When we did the math on the result from this pipeline and compared it against the current pipeline the "happy path" would be around 90% cheaper and about twice as fast.
The proof of concepts took us a couple of days to implement and compare against the current pipeline. Based on the results of this work and that we are now more sure about the scaling characteristics we made the decision to start the implementation of the next version for real.