Common pitfalls of data pipeline projects, and how to avoid them

Knowing, understanding and managing your data throughout its lifecycle is more important than it has ever been. And more difficult. 

Of course, the never ending growth in data volume is partly responsible for this, as are also countless processes that need to be applied to the data to ensure it is usable and effective. Which is why data analysts and data engineers turn to data pipelining.

Added complexity is involved when, In order to keep abreast of the latest requirements, organisations need to constantly deploy new data technologies alongside legacy infrastructure. 

All three of these elements mean that, inevitably, data pipelines are becoming more complicated as they grow. In the final article in our data pipeline series, we have highlighted some of the common pitfalls that we have learned from our experience over the years and how to avoid them. These are also part of our Data Pipeline Playbook.

About this series

This is the final post in our six part series on the data pipeline, taken from our latest playbook. Now we look at the many pitfalls you can encounter in a data pipeline project. In the series before now, we looked at what a data pipeline is and who it is used by. Next we looked at the six main benefits of a good data pipeline, part three considered the ‘must have’ key principles of data pipeline projects, and part four and five covered the essential practices of a data pipeline. So here’s our list of some of the pitfalls we’ve experienced when building data pipelines in partnership with various clients. We’d encourage you to avoid the scenarios listed below.

Avoid tightly coupling your analytics pipelines with other business processes

Analytics data pipelines provide data to produce insights about your customers, business operations, technology performance, and more. For example, the role of a data warehouse is to create an historical record of data that can be mined for insights.

It is tempting to see these rich data sources as the best source of data for all data processing and plumb key business activities in these repositories. However, this can easily end up preventing the extraction of insights it was implemented for. Data warehouses can become so integrated into business operations – effectively acting as the Operational Data Store (ODS) – that they can no longer function as a data warehouse. Key business activities end up dependent on the fast processing of data drawn from the data warehouse, which prevents other users from running queries on the data they need for their analyses.

Modern architectures utilise a micro-service architecture, and we advocate this digital platform approach to delivering IT functionality (see our Digital Platform Playbook). Micro-services should own their own data – and as there is unlikely to be a one-size-fits-all solution to volumes, latencies, or use of master or reference data of the many critical business data flows implemented as micro-services. Great care should be taken as to which part of the analytics data pipelines they should be drawn from. The nearer the data they use is to the end users, the more constrained your data analytics pipeline will become over time, and the more restricted analytics users will become in what they can do.

If a micro-service is using a whole pipeline as part of its critical functionality, it is probably time to reproduce the pipeline as a micro-service in its own right, as the needs of the analytics users and the micro-service will diverge over time.

Include data users early on

We are sometimes asked if we can implement data pipelines without bothering data users. They are often very busy interfacing at senior levels, and as their work provides key inputs to critical business activities and decisions, it can be tempting to reduce the burden on them and think that you already understand their needs.

In our experience this is nearly always a mistake. Like any software development, understanding user needs as early as you can, and validating that understanding through the development, is much more likely to lead to a valued product. Data users almost always welcome a chance to talk about what data they want, what form they want it in, and how they want to access it. When it becomes available, they may well need some coaching on how to access it.

Keep unstructured raw inputs separate from processed data

In pipelines where the raw data is unstructured (e.g. documents or images), and the initial stages of the pipeline extract data from it, such as entities (names, dates, phone numbers, etc.), or categorisations, it can be tempting to keep the raw data together with the extracted information. This is usually a mistake. Unstructured data is always of a much higher volume, and keeping it together with extracted data will almost certainly lead to difficulties in processing or searching the useful, structured data later on. Keep the unstructured data in separate storage (e.g., different buckets), and store links to it instead.

We hope that this article, along with all the others in the series, will help you create better pipelines and address the common challenges that can occur when building and using them. Data pipeline projects can be challenging and complicated, but done correctly they securely gather information and allow you to make valuable decisions quickly and effectively. 

Contact us!

For more information on data pipelines in general, take a look at our Data Pipeline Playbook.  And if you’d like us to share our experience of data pipelines with you, get in touch using the form below.

Managing the flow of information from a source to the destination system forms an integral part of every enterprise looking to generate value from their data.

Data and analytics are critical to business operations, so it’s important to engineer and deploy strong and maintainable data pipelines by following some essential practices.

This means there’s never been a better time to be a data engineer. According to DICE’s 2020 Tech Job Report, Data Engineer is the fastest-growing job in 2019, growing by 50% YoY. Data Scientist is also up there on the list, growing by 32% YoY.

But the parameters of the job are changing. Engineers now provide guidance on data strategy and pipeline optimisation and, as the sources and types of data become more complicated, engineers must know the latest practices to ensure increased profitability and growth. 

In our data pipeline playbook we have identified eleven practices to follow when creating a data pipeline. We touched on six of these practices in our last blog post. Now we talk about the other five, including iteratively creating your data models as well as observing the pipeline.  Applying these practices will allow you to integrate new data sources faster at a higher quality.

About this series

This is part five in our six part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. Next we considered the “must have” key principles of data pipeline projects  and in part four, we looked at the six key practices needed for a data pipeline. Now we go into details of more of those practices, before finishing off our series in part six with a look at the many pitfalls you can encounter in a data pipeline project. 

Practice Seven: Observe the pipeline

Data sources can suddenly stop functioning for many reasons – unexpected changes to the format of the input data, an unanticipated rotation of secrets or change to access rights, or something happens in the middle of the pipeline that drops the data. This should be expected and means of observing the health of data flows should be implemented. Monitoring the data flows through the pipelines will help detect when failures have occurred and prevent adverse impacts. Useful tactics to apply include:

  • Measuring counts or other statistics of data going in and coming out at various points in the pipeline.
  • Implementing thresholds or anomaly detection on data volumes and alarms when they are triggered.
  • Viewing log graphs – use the shapes to tell you when data volumes have dropped unexpectedly.

Practice Eight: Data models are important and should be addressed iteratively

For data to be valuable to the end users (BI teams or data scientists), it has to be understandable at the point of use. In addition, analytics will almost always require the ability to merge data from sources. In our experience, many organisations do not suffer from big data as much as complex data – with many sources reporting similar or linked data – and a key challenge is to conform the data as a step before merging and aggregating it.

All these challenges require a shared understanding of data entities and fields – and need some kind of data model to resolve to.  If you ignore this data model at the start of the pipeline, you will have to address these needs later on.

However, we do not recommend the development of an enterprise data model before data can be ingested into the system. Rather, starting with the needs of the data users in the initial use cases will lead you to a useful data model that can be iterated and developed over time.

Practice Nine: Apply master data/reference data pragmatically to support merging

Most pipelines require data to be conformed not just to the schema but also against known entities such as organisational units, product lists, currencies, people, companies, and so forth. Ignoring this master data on ingestion will make it harder to merge data later on. However, master data management often becomes overwhelming and starts to seem as if the whole enterprise needs modelling. To avoid data analysis paralysis, we recommend starting from the initial use cases and iteratively building reference data and master data into the pipelines as they are needed.

Practice Ten: Use orchestration and workflow tools

Pipelines typically support complex data flows composed of several tasks. For all but the simplest pipelines, it is good practice to separate the dataflow from the code for the individual tasks. There are many tools that support this separation – usually in the form of Directed Acyclic Graphs (DAGs). In addition to supporting a clear isolate and reuse approach, and enabling continuous development through providing version control of the data flow, DAGs usually have a simple means of showing the data dependencies in a clear form, which is often useful in identifying bugs and optimising flows.

Depending on the environment and the nature and purpose of the pipeline, some tools we have found useful are:

  •   Apache Airflow
  •   dbt
  •   Argo Workflows
  •   DVC
  •   Dagster
  •   AWS Glue

Practice Eleven: Continuous testing

As with any continuous delivery development, a data pipeline needs to be continuously tested. However, data pipelines do face additional challenges such as:

  • There are typically many more dependencies such as databases, data stores and data transfers from external sources, all of which make pipelines more fragile than application software – the pipes can break in many places. Many of these dependencies are complex in themselves and difficult to mock out.
  • Even individual stages of a data pipeline can take a long time to process – anything with big data may well take hours to run. Feedback time and iteration cycles can be substantially longer.
  • In pipelines with Personally Identifiable Information (PII), PII data will only be available in the production environment. So how do you do your tests in development? You can use sample data which is PII-clean for development purposes. However, this will miss errors caused by unexpected data that is not in the development dataset, so you will also need to test within production environments – which can feel uncomfortable for many continuous delivery practitioners.
  • In a big data environment, it will not be possible to test everything – volumes of data can be so large that you cannot expect to test against all of it.

We have used a variety of testing practices to overcome these challenges:

  • The extensive use of integration tests – providing mock-ups of critical interfaces or using smaller-scale databases with known data to give quick feedback on schemas, dependencies and data validation.
  • Implementing ‘development’ pipelines in the production environment with isolated ‘development’ clusters and namespaces. This brings testing to the production data, avoiding PII issues, and sophisticated data replication/emulation across environments.
  • Statistics-based testing against sampled production data for smaller feedback loops on data quality checks.
  • Using infrastructure-as-code testing tools to test whether critical resources are in place and correct (see https://www.equalexperts.com/blog/our-thinking/testing-infrastructure-as-code-3-lessons-learnt/ for a discussion of some existing tools).

Hopefully this gives a clearer overview of some of the essential practices needed to create an effective data pipeline. In the next blog post in this series, we finish our series by looking at the many pitfalls you can encounter in a data pipeline project. Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.  

Contact us!

If you’d like us to share our experience of data pipelines with you, get in touch using the form below.

A carefully managed data pipeline can provide you with seamless access to reliable and well-structured datasets.

A generalised form of transferring data from a source system A to a source system B, data pipelines are developed in small pieces, and integrated with data, logic and algorithms to perform complex transformations. To do this effectively, there are some essential practices that need to be adhered to.

In our data pipeline playbook we have identified eleven practices to follow when creating a data pipeline.  Here we touch on six of these practices such as how to start by using a steel thread, and in our next blog post we will talk about iteratively creating your data models as well as observing the pipeline.  Applying these practices will allow you to integrate new data sources faster at a higher quality as outlined in our recent post on the benefits of a data pipeline.

About this series

This is part four in our six-part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. In part three we considered the ‘must have’ key principles of data pipeline projects. Now we look at the six key practices needed for a data pipeline. Before we get into the details we just want to cover off what’s coming in the rest of the series. In part five we look at more of those practices, and in part six we look at the many pitfalls you can encounter in a data pipeline project. 

The growing need for good data engineering

Today, data engineers serve a wider audience than just a few years ago. As there is a growing need for organisations to apply machine learning techniques to their data, new challenges are faced by data engineers in order to remain relevant. Essential to every project is the ability to reliably deliver large-volume data sets so that data scientists can train more accurate models.

Aside from dealing with larger data volumes, these pipelines need to be flexible in order to accommodate the variety of data and the increasingly high processing velocity required. The following practices are those that we feel are essential to successful projects, the minimum requirement for success. They are based on our collective knowledge and experience gained across many data pipeline engagements.  

Practice 1: Build for the right latency

When designing the pipeline, it’s important to consider what level of latency you need. What is your speed of decision? How quickly do you need the data? Building and running a low latency, real-time data pipeline will be significantly more expensive, so make sure that you know you need one before embarking on that path. You should also ask how fast your pipeline can be. Is it even possible for you to have a real-time data pipeline? If all your data sources are produced by daily batch jobs, then the best latency you can reach will be daily updates, and the extra cost of real-time implementations will not provide any business benefits.

If you do need to be within real-time or near real-time, then this needs to be a key factor at each step of the pipeline. The speed of the pipe is conditioned by the speed of the slowest stage.

And be careful not to confuse the need for a real-time decision engine with the need for a real-time historical data store, such as a data warehouse for the data scientists. Decision models are created from stores of historical data and need to be validated before deployment into production. Model release usually takes place at a slower cadence (e.g., weekly or monthly). Of course, the deployed model will need to work on a live data stream, but we consider this part of the application development. This is not the appropriate use for a data warehouse or similar.

Practice 2: Keep raw data

Ingestions should start by storing raw data in the pipeline without making any changes. In most environments, data storage is cheap, and it is common to have all the ingested data persisted and unchanged. Typically, this is done via cloud file storage (S3, GCP Cloud Storage, Azure Storage), or HDFS for on-premise data.

Keeping this data allows you to reprocess it without re-ingestion if any business rule changes, and it also retains the possibility of new pipelines based on this data if, for example, a new dashboard is needed.

Practice 3: Break transformations into small tasks

Pipelines are usually composed of several transformations of the data, activities such as format validation, conformance against master data, enrichment, imputation of missing values, etc. Data pipelines are no different from other software and should thus follow modern software development practices of breaking down software units into small reproducible tasks. Each task should target a single output and be deterministic and idempotent. If we run a transformation on the same data multiple times, the results should always be the same.

By creating easily tested tasks, we increase the quality and confidence in the pipeline, as well as enhance the pipeline maintainability. If we need to add or change something on the transformation, we have the guarantee that if we rerun it, the only changes will be the ones we made.

Practice 4: Support backfilling

If the pipelines are mature at the start of development, it may not be possible to fully evaluate whether the pipeline is working correctly or not. Is this metric unusual because this is what always happens on Mondays, or is it a fault in the pipeline? We may well find at a later date that some of the ingested data was incorrect. Imagine you find out that during a month, a source was reporting incorrect results, but for the rest of the time, the data was correct.

We should engineer our pipelines so that we can correct them as our understanding of the dataflows matures. We should be able to backfill the stored data when we have identified a problem in the source or at some point in the pipeline, and ideally, it should be possible to backfill just for the corresponding period of time, leaving the data for other periods untouched.

Practice 5: Start with a steel thread

When starting at a greenfield site, we typically build up data pipelines iteratively around a steel thread – first a thin data pipe which is a thin slice through the architecture. This progressively validates the quality and security of the data. The first thread creates an initial point of value – probably a single data source, with some limited processing, stored where it can be accessed by at least one data user. The purpose of this first thread is to provide an initial path to data and uncover unexpected blockers, so it is selected for simplicity rather than having the highest end-user value. Bear in mind that in the first iteration, you will need to:

  • Create a cloud environment which meets the organisation’s information security needs.
  • Set up the continuous development environment.
  • Create an appropriate test framework.
  • Model the data and create the first schemas in a structured data store.
  • Coach end users on how to access the data.
  • Implement simple monitoring of the pipeline.

Later iterations will bring in more data sources and provide access to wider groups of users, as well as bringing in more complex functionality such as:

  • Including sources of reference or master data.
  • Advanced monitoring and alerting.

Practice 6: Utilise cloud – define your pipelines with infrastructure-as-code

Pipelines are a mixture of infrastructure (e.g., hosting services, databases, etc.), processing code, and scripting/configuration. They can be implemented using proprietary and/or open-source technologies. However, all of the cloud providers have excellent cloud native services for defining, operating and monitoring data pipelines. They are usually superior in terms of their ability to scale with increasing volumes, simpler to configure and operate, and support a more agile approach to data architecture.

Whichever solution is adopted, since pipelines are a mixture of components, it is critical to adopt an infrastructure-as-code approach. Only by having the pipeline defined and built using tools, such as terraform, and source controlled in a repository, will pipeline owners have control over the pipeline and the confidence to rebuild and refine it as needed.

Hopefully this gives a clearer overview of some of the essential practices needed to create an effective data pipeline. In the next blog post in this series, we will outline more of the practices needed for data pipelines.  Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.  

Contact us!

If you’d like us to share our experience of data pipelines with you, get in touch using the form below.

 

Facing an ever-growing set of new tools and technologies, high functioning analytics teams have come to rely increasingly on data engineers. Building and managing production data engineering pipelines is an inherently complex process, which can prove hard to scale without a systematic approach.

To help navigate this complexity, we have compiled our top advice for successful solutions. Here we examine some of the key guiding principles to help data engineers (of all experience levels) effectively build and manage data pipelines. These have been compiled using the experience of the data engineers at Equal Experts. They collectively recommend the adoption of these principles as they will help you lay the foundation to create sustainable and enduring pipelines.  

About this series

This is part three in our six part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. Now we consider the “must have” key principles of data pipeline projects. Before we get into the details, we just want to cover off what’s coming in the rest of the series. In part four, we look at the six key practices needed for a data pipeline. In part five we investigate more of those practices, and in part six we look at the many pitfalls you can encounter in a data pipeline project. 

The growing need for good data engineering

If I have learned anything from my years working as a data engineer, it is that practically every data pipeline fails at some point. Broken connections, broken dependencies, data arriving too late, or unreachable external systems or APIs. There are many reasons. But, regardless of the cause, we can do a lot to mitigate the impact of a data pipeline’s failure. These ‘must have’ principles are built up over the years to help to ensure that projects are successful. They are based on my knowledge, and the Equal Experts team’s collective experience, gained across many data pipeline engagements.  

Data pipelines are products

Pipelines bring data from important business sources. In many cases, they feed reports and analyses that endure for a long time. Unless your business does not expect to alter how it operates, or there are no amendments to low-level processes, the data pipelines will always need to adapt to the changes in the fundamental processes, new IT, or the data itself.  As something that should respond to and embrace regular change, pipelines should be treated as products rather than projects.

This means that there should be multi-year funding to monitor and maintain the existing pipelines. Providing headroom to add new ones, and supporting the analysis and retirement of old ones. Pipelines need product managers to understand the pipelines’ current status and operability, and to prioritise the work. (See this Forbes article for a wider description of working in product-mode over project-mode.)

Find ways for making common use of the data

The data collected for a given problem or piece of analysis will nearly always be useful in answering other questions. When creating pipelines, we try to architect them in a way that allows reuse, whilst also remaining lean in our implementation choices.

In many cases there are simple ways of achieving this. For example, there are usually a variety of places where data is stored in the pipeline. Raw ingested data might be useful for unanticipated purposes. And it can often be made available to skilled users by providing them access to the landing zone.

Appropriate identity and access technologies, such as role-based access, can support reuse while permitting strict adherence to data-protection policies and regulations. The fundamental architecture can stay the same, with access being provided by adding or amending access roles and permissions to data buckets, databases or data warehouses.

A pipeline should operate as a well-defined unit of work

Pipelines have a cadence driven by the need for decision-making and limited by the availability of source data. The developers and users of a pipeline should understand and recognise this as a well-defined unit of work – whether every few seconds, hourly, daily, monthly or event-driven.

Pipelines should be built around use cases

In general, we recommend building pipelines around the use case rather than the data source. This will help ensure that business value is achieved early. In some cases, the same data source might be important to several use cases, each with different cadences and access rights. Understanding when to reuse parts of pipelines and when to create new ones is an important consideration. For example, faster pipelines can always be used for slower cadences, but it typically requires more effort to maintain and adapt them. It might be simpler to create a simpler batch pipeline to meet a new low-latency use case that is not expected to change substantially than to focus on upgrading a fast-streaming pipe to meet the new requirements. 

Continuously deliver your pipelines

We want to be able to amend our data pipelines in an agile fashion as the data environment and needs of the business change. So, just like any other piece of working software, continuous delivery practices should be adopted to enable continuous updates of data pipelines in production. Adopting this mindset and these practices is essential to support continuous improvement and create feedback loops that rapidly expose problems and address user feedback.

Consider how you name and partition your data

Data pipelines are a mix of code and infrastructure that can become confusing as they grow if care is not taken with the naming. Pipelines will include at least a set of databases, tables, attributes, buckets, roles, etc., and they should be named in a consistent way to facilitate understanding and maintenance of the pipelines, as well as make the data meaningful to the end-users.

In many architectures, naming will directly affect how your data is partitioned, which in turn affects the speed of the search and retrieval of data. Consider what will be the most frequent queries when specifying bucket names, table partitions, shards, and so on.

Want to know more?

These guiding principles have been born out of our engineers and use each of their 10+ years of data engineering for end-to-end machine learning solutions. We are sure there are lots of other principles, so please do let us know of any approaches you have found effective in managing data pipelines. 

In our next blog post in this series we will start laying out some of the key practices of data pipelines.  Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.  

Contact us!

If you’d like us to share our experience of data pipelines with you, get in touch using the form below.