A carefully managed data pipeline can provide you with seamless access to reliable and well-structured datasets.
A generalised form of transferring data from a source system A to a source system B, data pipelines are developed in small pieces, and integrated with data, logic and algorithms to perform complex transformations. To do this effectively, there are some essential practices that need to be adhered to.
In our data pipeline playbook we have identified eleven practices to follow when creating a data pipeline. Here we touch on six of these practices such as how to start by using a steel thread, and in our next blog post we will talk about iteratively creating your data models as well as observing the pipeline. Applying these practices will allow you to integrate new data sources faster at a higher quality as outlined in our recent post on the benefits of a data pipeline.
About this series
This is part four in our six-part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. In part three we considered the ‘must have’ key principles of data pipeline projects. Now we look at the six key practices needed for a data pipeline. Before we get into the details we just want to cover off what’s coming in the rest of the series. In part five we look at more of those practices, and in part six we look at the many pitfalls you can encounter in a data pipeline project.
The growing need for good data engineering
Today, data engineers serve a wider audience than just a few years ago. As there is a growing need for organisations to apply machine learning techniques to their data, new challenges are faced by data engineers in order to remain relevant. Essential to every project is the ability to reliably deliver large-volume data sets so that data scientists can train more accurate models.
Aside from dealing with larger data volumes, these pipelines need to be flexible in order to accommodate the variety of data and the increasingly high processing velocity required. The following practices are those that we feel are essential to successful projects, the minimum requirement for success. They are based on our collective knowledge and experience gained across many data pipeline engagements.
Practice 1: Build for the right latency
When designing the pipeline, it’s important to consider what level of latency you need. What is your speed of decision? How quickly do you need the data? Building and running a low latency, real-time data pipeline will be significantly more expensive, so make sure that you know you need one before embarking on that path. You should also ask how fast your pipeline can be. Is it even possible for you to have a real-time data pipeline? If all your data sources are produced by daily batch jobs, then the best latency you can reach will be daily updates, and the extra cost of real-time implementations will not provide any business benefits.
If you do need to be within real-time or near real-time, then this needs to be a key factor at each step of the pipeline. The speed of the pipe is conditioned by the speed of the slowest stage.
And be careful not to confuse the need for a real-time decision engine with the need for a real-time historical data store, such as a data warehouse for the data scientists. Decision models are created from stores of historical data and need to be validated before deployment into production. Model release usually takes place at a slower cadence (e.g., weekly or monthly). Of course, the deployed model will need to work on a live data stream, but we consider this part of the application development. This is not the appropriate use for a data warehouse or similar.
Practice 2: Keep raw data
Ingestions should start by storing raw data in the pipeline without making any changes. In most environments, data storage is cheap, and it is common to have all the ingested data persisted and unchanged. Typically, this is done via cloud file storage (S3, GCP Cloud Storage, Azure Storage), or HDFS for on-premise data.
Keeping this data allows you to reprocess it without re-ingestion if any business rule changes, and it also retains the possibility of new pipelines based on this data if, for example, a new dashboard is needed.
Practice 3: Break transformations into small tasks
Pipelines are usually composed of several transformations of the data, activities such as format validation, conformance against master data, enrichment, imputation of missing values, etc. Data pipelines are no different from other software and should thus follow modern software development practices of breaking down software units into small reproducible tasks. Each task should target a single output and be deterministic and idempotent. If we run a transformation on the same data multiple times, the results should always be the same.
By creating easily tested tasks, we increase the quality and confidence in the pipeline, as well as enhance the pipeline maintainability. If we need to add or change something on the transformation, we have the guarantee that if we rerun it, the only changes will be the ones we made.
Practice 4: Support backfilling
If the pipelines are mature at the start of development, it may not be possible to fully evaluate whether the pipeline is working correctly or not. Is this metric unusual because this is what always happens on Mondays, or is it a fault in the pipeline? We may well find at a later date that some of the ingested data was incorrect. Imagine you find out that during a month, a source was reporting incorrect results, but for the rest of the time, the data was correct.
We should engineer our pipelines so that we can correct them as our understanding of the dataflows matures. We should be able to backfill the stored data when we have identified a problem in the source or at some point in the pipeline, and ideally, it should be possible to backfill just for the corresponding period of time, leaving the data for other periods untouched.
Practice 5: Start with a steel thread
When starting at a greenfield site, we typically build up data pipelines iteratively around a steel thread – first a thin data pipe which is a thin slice through the architecture. This progressively validates the quality and security of the data. The first thread creates an initial point of value – probably a single data source, with some limited processing, stored where it can be accessed by at least one data user. The purpose of this first thread is to provide an initial path to data and uncover unexpected blockers, so it is selected for simplicity rather than having the highest end-user value. Bear in mind that in the first iteration, you will need to:
- Create a cloud environment which meets the organisation’s information security needs.
- Set up the continuous development environment.
- Create an appropriate test framework.
- Model the data and create the first schemas in a structured data store.
- Coach end users on how to access the data.
- Implement simple monitoring of the pipeline.
Later iterations will bring in more data sources and provide access to wider groups of users, as well as bringing in more complex functionality such as:
- Including sources of reference or master data.
- Advanced monitoring and alerting.
Practice 6: Utilise cloud – define your pipelines with infrastructure-as-code
Pipelines are a mixture of infrastructure (e.g., hosting services, databases, etc.), processing code, and scripting/configuration. They can be implemented using proprietary and/or open-source technologies. However, all of the cloud providers have excellent cloud native services for defining, operating and monitoring data pipelines. They are usually superior in terms of their ability to scale with increasing volumes, simpler to configure and operate, and support a more agile approach to data architecture.
Whichever solution is adopted, since pipelines are a mixture of components, it is critical to adopt an infrastructure-as-code approach. Only by having the pipeline defined and built using tools, such as terraform, and source controlled in a repository, will pipeline owners have control over the pipeline and the confidence to rebuild and refine it as needed.
Hopefully this gives a clearer overview of some of the essential practices needed to create an effective data pipeline. In the next blog post in this series, we will outline more of the practices needed for data pipelines. Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.
If you’d like us to share our experience of data pipelines with you, get in touch using the form below.