A Three Step Approach for Building Data Science Pipelines

A Three Step Approach for Building Data Science Pipelines

Data scientists at Unacast typically spend a lot of time designing pipelines for extracting insights from data. A common pain point is transitioning data science pipelines from prototypes to the production environment. We have been experimenting with a three step approach for developing pipelines and producing them. The approach consists of prototyping using Python in Colaboratory notebooks, intermediate small-scale testing using Apache Beam, and porting to Java for production systems. Here we will cover these three steps, as well as some of the lessons we have learned along the way.

Prototyping in Colaboratory Notebooks

Colaboratory notebooks are the primary tools for one-off analyses and early stage development among data scientists at Unacast. Colaboratory is built on top of the Jupyter Project and allows interactive programming in Python. Here Python is the language of choice due to the extensive ecosystem of data wrangling tools and machine learning libraries. Colaboratory documents can be simultaneously edited just like any other Google Doc which is great for collaboration and quick iterations. These Notebooks are stored on Google Drive and executed on the cloud so there is no need to manage local environments.

Distributed Processing with Apache Beam

Colaboratory notebooks are great for prototyping, but quickly run into resource issues with larger datasets. Issues such as timeouts and out of memory errors occur even on tiny fractions of daily processed data. We use Apache Beam for distributed processing of large data sets. Processing is performed using Dataflow runners on the Google Cloud Platform. The Apache Beam SDK is available for Python, Java and Go. By choosing the Python SDK we are able to continue using modules and code from prototypes without significant rewrites. The Beam SDK is mainly used for testing algorithms on larger data samples in production-like environments. It serves as a way to check that the model behaves as expected before moving to invest in building a more robust version.

Production Code in Java

Apache Beam pipelines using the Python SDK are a good choice for small to medium pipelines but run into performance and maintainability issues as project size and data volumes increase.

Critical data pipelines are therefore ported to the Java SDK of Apache Beam before being rolled out into production.

Additional Key Takeaways

The three step process for developing and producing data science pipelines presents several additional benefits:

  • An interactive programming environment is crucial for rapid prototyping of data science pipelines. Colaboratory has all the benefits of Jupyter notebooks, in addition to great options for sharing and collaboration.
  • The Python SDK for Beam provides a rapid way of testing model performance on larger datasets. Models can be refined or scrapped before committing the time to build a production pipeline.
  • Some use cases do not require rewriting pipelines in Java. The main benefits of porting pipelines from Python to the Java SDK are performance and a more mature SDK. Sticking to the Python SDK is advisable where the performance is acceptable.

Related Articles