understanding-data-pipelines-eda-lead_1200x514
Scott Cutts
Scott Cutts Lead Consultant

Our Thinking Thu 9th September, 2021

Understanding the role of data pipelines and data platforms in event-driven architecture

In an event-driven architecture, you’re empowered to use real-time triggers across your organisation to deliver value—or mitigate risk—in the moment. To do this effectively, you need the right information with the right people, immediately.

This is where data pipelines and data platforms play a crucial role.

The best way to understand and improve your business is to metricate it. In an event-driven architecture, these metrics offer two tiers of value:

  1. A system of record to outline actions that have occurred (which, over time, can lend themselves to big data processing, machine learning, artificial intelligence, and more)
  2. Time-sensitive insights you can use to improve efficiencies, create value for customers, or take proactive steps to mitigate risk as things unfold in the moment

To make the most of both of these potential benefits, you need to make information rapidly available so that people or systems can act on it.

Whether you want to provide a better customer experience or protect against fraud, you’ll likely benefit from implementing some form of data pipeline and data platform.

What is a data pipeline? And what is a data platform?

A data pipeline is the means by which you move information around your organisation, connecting data users with data sources. For example, how information travels to a data scientist from an event queue like RabbitMQ or Apache Kafka.

So, what’s a data user? ‘User’ is a term we typically associate with UX design, but the reality is that most organisations have a wide range of data users. These range from specific roles like data scientists and business intelligence people, to general managers or teams who oversee a domain or series of domains within an organisation.

Then there’s data sources. In the context of an event-driven architecture, these will likely be ‘listeners’—microservices configured to consume specific streams of information from an event queue. Find out more about understanding microservices and event-driven architecture in comparison to a monolith here.

A data platform is the place where various data users from across the organisation can access the information generated by data sources in the ways that are meaningful for their specific analysis or action.

For example, if you need information about sales figures across the past year, you can give the relevant people different administrative permissions to access a data platform. The platform will transform the information into a format that’s meaningful for that specific team and their unique use cases. This might involve accruing insights to suggest new functionalities to improve support for sales activity, for example.

Another example of a data platform is our work with Her Majesty’s Revenue and Customs’ Customer Insight Platform (commonly referred to as CIP).

Fundamentally, CIP tracks a variety of user activities in a Multi-channel Digital Tax Platform and surfaces the data for different business domains—from customer service to risk analysis—with a view to predicting and preventing fraud in real-time. You can read more about the CIP here.

Alternatively, you can learn more about data pipelines and platforms in our Data Pipelines Playbook.

What does a typical process flow involve?

Transporting data effectively typically involves the following steps and processes.

  • Data sources: In an event-driven architecture, data sources will likely be events—aka real-world actions or unique business triggers—which will be fed into pipelines via an event backbone; solutions like Apache Kafka or Azure Event Hub.
  • Ingestion process: Ingestion refers to the means by which data is moved from the source into the pipeline. In an event-based architecture ingestion is implemented by a listener to the event stream.
  • Transformation: In most cases, data needs to be transformed from the input format of the rata data to the one in which it is stored. There may be several transformations in a pipeline.
  • Data quality control and cleansing: Data is checked for quality at various points in the pipeline. Data quality will typically include at least validation of data types and format, as well as conforming with master data.
  • Enrichment: Data items may be enriched by adding additional fields, such as reference data.
  • Storage: Data is stored at various points in the pipeline and a structured store (such as a data warehouse).
  • Presentation of data to end users: Surfacing data is typically provisioned by a data platform.

It’s worth noting that the above flow highlights an Extract-Load-Transform (ELT) process, rather than a traditional Extract-Transfer-Load (ETL) approach.

The benefit associated with ELT is that you maintain a record of the raw data in its original form. If you develop use cases or strategies that are contingent on that information, that were not apparent as you were extracting the information in the first place, you can source the original data in its raw form for evolving applications.

What benefits do data pipelines and platforms offer in the context of an event-driven architecture?

If you’re operating in a low-latency environment—where data needs to be updated and visible in near-to-real-time—your architecture needs to be event-driven.

In these low latency event-driven environments, data pipelines provide immediate analytical insight.

In other words, data is used to create insights that facilitate real-time activity in response to certain triggers, as opposed to cultivating a system of record.

Let’s consider an example. Imagine a logistics or freight company moving packages across the city. The business will likely have data ingest and pipelines that outline capacity to dispatch orders. In this example, the organisation will need to understand:

  • Which departments or team members have capacity to dispatch more orders?
  • Which departments are overwhelmed or at capacity
  • How many orders are being processed at each location around the city?
  • Where to funnel order for dispatch to maximise productivity

Of course, the answers to these questions will fluctuate dramatically throughout the day.

In order for low-latency information to be meaningful, it needs to be event-driven; it needs to be ingested and moved throughout the organisation via data pipelines; and it needs to be surfaced for relevant parties to monitor in meaningful ways through a data platform.

Common pitfalls to avoid when implementing a data pipeline or data platform with an event-driven architecture.

There are a wide range of practices and pitfalls outlined in our data pipelines playbook, which are equally applicable to both event-driven architectures and batched processing.

However, there is one unique pitfall that we increasingly see; avoid coupling your data pipeline into other business processes.

In certain instances, you might ingest data from a pipeline into a data warehouse and people begin to analyse the information. Then, they’ll start building services that use and rely on that flow of data.

As a result of this practice, you’ll start establishing services downstream of the data warehouse. This practice—in and of itself—is not necessarily problematic. But if it’s taken too far and you build too many interdependencies that rely on a certain stream of information from the data warehouse, you inevitably end up constrained. You can’t be adaptive or flexible in the way that a data warehouse should be, because:

  • Analytical users can’t access the data in the way they want or need to
  • Other users throughout the organisation can’t make ad-hoc queries they need to on the fly

To mitigate this issue, you might develop a data platform to serve specific information to unique teams.

The best way to ascertain whether a data platform will mitigate the issues is to ask yourself: how much impact are we creating—and how dependent are we—on the data warehouse.

Looking for more information on event-driven architecture and data pipelines?

Take a look through the following resources for further details:

Looking to learn more about Event-Driven Architecture?