Supporting Austrade to embrace world-class data practice

Last month, I had the great honour of being invited to speak at Austrade’s Data Champions conference, a quarterly convention for the Federal Government’s Australian Trade and Investment Commission.

Serving a wide range of stakeholders—both at home and internationally—the Australian Trade and Investment Commission (also known as Austrade) delivers services to grow Australia’s economic prosperity. It’s all about helping businesses go further, faster. Obviously, data plays a huge role in their approach to high quality service provision.

To support the team in their excellent work, I was happy to deliver a brief presentation on the importance and value of data pipelines.

This included some of the new ways we think about, and work with, data at Equal Experts. Plus—given Austrade’s focus on ‘connecting Australian businesses with the world’—some of our learnings from embedding leading data practice through our ongoing collaborations with Her Majesty’s Revenue & Customs (HMRC); the United Kingdom’s equivalent of the Australian Taxation Office (ATO). Over a working partnership of many years, we’re proud to note that Equal Experts is one of the top five resource providers for HMRC. We’ve worked together on everything from cutting-edge fraud detection mechanisms to transitioning from physical infrastructure to the cloud.

In contrast, this event was singularly focused and all about data.

Here are some of the key points that seemed to resonate throughout the presentation.

1. Data can—and should—be agile.

Historically, and in many large-scale organisations around the world today, there’s a tendency to conceptualise and treat data in a certain way. In many instances, it’s a case of approaching data without any sense of fluidity or immediacy.

In the old world, data isn’t something to be dynamically accessed. It’s built up over time and then used to create reports retrospectively based on insights gleaned from the material. Additionally, there’s a prevailing conception that data is slow to establish; you need to build up expanses of information before any meaningful implementation.

In fact, the opposite is true. With the right approach, you can act much more fluidly and create real business value in real-time.

For example, we design and implement data pipelines with highly reusable patterns; this ensures organisations can rapidly create new data pipelines as use-cases or business requirements evolve. And they do, and should, evolve.

If something changes in your organisation—the source of data, the use case associated with the data being collected, the utility of how or why that data is important—then you need to evolve your pipeline(s) to reflect those developments. This is where the concept of agile data practice comes to the fore.

There’s a valuable practice of continually reviewing data pipelines, which many organisations fail to realise. Data collection and collation isn’t a set-and-forget proposition, unless your organisation itself is comfortable in stasis.

We typically approach data practice through the lens of agile delivery, with practices and rituals such as:

  • Discovery and stakeholder engagement: Collect the necessary context for any data set by incorporating business drivers, a range of data sources, your current or desired capabilities, and the reality of your IT systems. This de-risks delivery by ensuring you have everything in place to hit the ground running: clear scope, visible dependencies, defined ways of working, and a delivery plan.
  • Iterative delivery: Short sprints with continuous feedback help deliver value rapidly and frequently throughout the process. Competitors talk about data in the context of delivering value by the 6-month mark; we prefer to deliver cyclical value every 3-4 weeks.
  • Continuous review: Regular review intervals enable stakeholders to continuously validate progress and decide when, and how, to release end users.
  • Launch and refine: Collect feedback and refine things using a data-driven approach.

2. Keep unstructured raw inputs separate from any processed data streams.

In terms of prioritising and maintaining a level of flexibility, it’s in your best interest to keep unstructured raw inputs separate from any processed data streams. This ensures you minimise the requirement to develop new end-to-end pipelines for new use cases. You simply draw data from the existing unstructured data for specific requirements as they become apparent.

By following this practice, you can keep your data flexible, agile, and easy-to-update. Which, in turn, facilitates far more value—often in real-time—from the information you cultivate.

3. Build your technical infrastructure around your business infrastructure: start with the use-case.

This relates to another crucial practice outlined in our Data Pipelines Playbook. It’s essential that you think of data pipelines as products, not projects. This means your data pipeline should have a product owner: someone who can prioritise deliverables and assist in defining use cases.

These use cases are critical. Effective data practice always starts with the use-case, rather than the technical implementation. A technical architecture must be driven by a business architecture, which should include the actual environment of the organisation in question.

You simply cannot define a high performance, highly effective technical architecture without that fundamental context of business or organisational requirements. And business requirements are often defined by the use-cases of the end users of the system, and the data that it generates.

Without detailed understanding of those use-cases, how do you calibrate and measure the efficacy of your solution?

If you’re ready to embed leading data practice at the core of your organisation, let’s tee up a conversation.

Alternatively, take a look through some of our other pieces on data pipelines:

As a follow-up from Language Agnostic Data Pipelines, the following post is focused on the use of dbt (data build tool).

Dbt is a command-line tool that enables us to transform the data inside a Data Warehouse by writing SQL select statements which represent the models. There is also a paid version with a web interface, dbt cloud, but for this article let’s consider just the command-line tool.

The intent of this article is not to make a tutorial about dbt – that already exists here, nor one about TDD, the goal is to illustrate how one of our software development practices, test-driven development, can be used to develop the dbt models.

Testing strategies in dbt

Dbt has two types of tests:

  • Schema tests: Applied in YAML, returns the number of records that do not pass an assertion — when this number is 0, all records pass and therefore your test passes.
  • Data tests: Specific queries that return 0 records.

Both tests can be used against staging/production data to detect data quality issues.

The second type of test gives us more freedom to write data quality tests. These tests run against a data warehouse loaded with data. They can run on production, on staging, or for instance against a test environment where a sample of data was loaded. These tests can be tied to a data pipeline so they can continuously test the ingested and transformed data.

Using dbt data tests to compare model results

With a little bit of SQL creativity, the data tests (SQL selects) can be naively* used to test model transformations, comparing the result of a model with a set of expectations:

with expectations AS (

   select 'value' as column1,

   union all 

   Select 'value 2' as column1

)


select * from expectations

except

select * from analytics.a_model

The query returns results when the expectations differ, so in this case dbt reports a test failure. However, this methodology isn’t effective to test the models due to the following facts:

  • The test input is shared among all the tests (this could be overcome by executing dbt test and the data setup for each test, although it’s not practical due to the lack of clarity and the maintainability of test suites).
  • The test input is not located inside the test itself, so it’s not user friendly to code nor easy to understand the goal of each test.
  • The dbt test output doesn’t show the differences between the expectations and the actual values, which slows down the development.
  • For each test, we need to have a boilerplate query with the previous format (with expectations as…).

Considering these drawbacks, It doesn’t seem like the right tool to make model transformation tests.

A strategy to introduce a kind of ‘data unit tests’

It’s possible and common to combine SQL with the templating engine Jinja (https://docs.getdbt.com/docs/building-a-dbt-project/jinja-macros). Also, It’s possible to define macros which can be used to extend dbt’s functionalities. That being said, let’s introduce the following macro:

unit_test(table_name, input, expectations)

The macro receives:

  • A table name (or a view name).
  • An input value that contains a set of inserts.
  • A table of expectations.

To illustrate the usage of the macro, here is our last test case refactored:

{% set table_name = ref('a_model') %}


{% set input %}

insert into a_table(column1) values (‘value’), (‘value2’);

{% endset %}


{% set expectations %}

select 'value' as column1,

union all 

select 'value 2' as column1

{% endset %}


{{ unit_test(table_name, input, expectations) }}

There is some boilerplate when using Jinja to declare the variables to call the unit test macro. Although, it seems a nice tradeoff, because this strategy enables us to:

  • Simplify the test query boilerplate.
  • Setup input data in plain SQL and in the same file.
  • Setup expectations in plain SQL and in the same file.
  • Run each test segregated from other tests.
  • Show differences when a test fails.

To illustrate the usage of this approach, here is a demo video:



The previous macro will be available in the repo published with the Language Agnostic Data Pipelines.

*naively coded because the use of EXCEPT between both tables fails to detect if duplicate rows exist. It could be fixed easily, but for illustrative purposes, we preferred to maintain the example as simple as we can.

Bringing software engineering practices to the data world

It is also easy to apply other standard software development practices such as integration with a ci/cd environment in dbt. This  is one of the advantages of using it over transforming data inside ETL tools which use a visual programming approach.

Wrapping up, we advocate that data oriented projects should always use the well-known software engineering best practices. We hope that this article shows how you can apply TDD  using the  emerging DBT data transformation tool.

Pedro Sousa​ paired on this journey with me. He is taking the journey from software engineering to data engineering in our current project, and he helped on the blog post.

Contact us!

For more information on data pipelines in general, take a look at our Data Pipeline Playbook.  And if you’d like us to share our experience of data pipelines with you, get in touch using the form below.

Managing the flow of information from a source to the destination system forms an integral part of every enterprise looking to generate value from their data.

Data and analytics are critical to business operations, so it’s important to engineer and deploy strong and maintainable data pipelines by following some essential practices.

This means there’s never been a better time to be a data engineer. According to DICE’s 2020 Tech Job Report, Data Engineer is the fastest-growing job in 2019, growing by 50% YoY. Data Scientist is also up there on the list, growing by 32% YoY.

But the parameters of the job are changing. Engineers now provide guidance on data strategy and pipeline optimisation and, as the sources and types of data become more complicated, engineers must know the latest practices to ensure increased profitability and growth. 

In our data pipeline playbook we have identified eleven practices to follow when creating a data pipeline. We touched on six of these practices in our last blog post. Now we talk about the other five, including iteratively creating your data models as well as observing the pipeline.  Applying these practices will allow you to integrate new data sources faster at a higher quality.

About this series

This is part five in our six part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. Next we considered the “must have” key principles of data pipeline projects  and in part four, we looked at the six key practices needed for a data pipeline. Now we go into details of more of those practices, before finishing off our series in part six with a look at the many pitfalls you can encounter in a data pipeline project. 

Practice Seven: Observe the pipeline

Data sources can suddenly stop functioning for many reasons – unexpected changes to the format of the input data, an unanticipated rotation of secrets or change to access rights, or something happens in the middle of the pipeline that drops the data. This should be expected and means of observing the health of data flows should be implemented. Monitoring the data flows through the pipelines will help detect when failures have occurred and prevent adverse impacts. Useful tactics to apply include:

  • Measuring counts or other statistics of data going in and coming out at various points in the pipeline.
  • Implementing thresholds or anomaly detection on data volumes and alarms when they are triggered.
  • Viewing log graphs – use the shapes to tell you when data volumes have dropped unexpectedly.

Practice Eight: Data models are important and should be addressed iteratively

For data to be valuable to the end users (BI teams or data scientists), it has to be understandable at the point of use. In addition, analytics will almost always require the ability to merge data from sources. In our experience, many organisations do not suffer from big data as much as complex data – with many sources reporting similar or linked data – and a key challenge is to conform the data as a step before merging and aggregating it.

All these challenges require a shared understanding of data entities and fields – and need some kind of data model to resolve to.  If you ignore this data model at the start of the pipeline, you will have to address these needs later on.

However, we do not recommend the development of an enterprise data model before data can be ingested into the system. Rather, starting with the needs of the data users in the initial use cases will lead you to a useful data model that can be iterated and developed over time.

Practice Nine: Apply master data/reference data pragmatically to support merging

Most pipelines require data to be conformed not just to the schema but also against known entities such as organisational units, product lists, currencies, people, companies, and so forth. Ignoring this master data on ingestion will make it harder to merge data later on. However, master data management often becomes overwhelming and starts to seem as if the whole enterprise needs modelling. To avoid data analysis paralysis, we recommend starting from the initial use cases and iteratively building reference data and master data into the pipelines as they are needed.

Practice Ten: Use orchestration and workflow tools

Pipelines typically support complex data flows composed of several tasks. For all but the simplest pipelines, it is good practice to separate the dataflow from the code for the individual tasks. There are many tools that support this separation – usually in the form of Directed Acyclic Graphs (DAGs). In addition to supporting a clear isolate and reuse approach, and enabling continuous development through providing version control of the data flow, DAGs usually have a simple means of showing the data dependencies in a clear form, which is often useful in identifying bugs and optimising flows.

Depending on the environment and the nature and purpose of the pipeline, some tools we have found useful are:

  •   Apache Airflow
  •   dbt
  •   Argo Workflows
  •   DVC
  •   Dagster
  •   AWS Glue

Practice Eleven: Continuous testing

As with any continuous delivery development, a data pipeline needs to be continuously tested. However, data pipelines do face additional challenges such as:

  • There are typically many more dependencies such as databases, data stores and data transfers from external sources, all of which make pipelines more fragile than application software – the pipes can break in many places. Many of these dependencies are complex in themselves and difficult to mock out.
  • Even individual stages of a data pipeline can take a long time to process – anything with big data may well take hours to run. Feedback time and iteration cycles can be substantially longer.
  • In pipelines with Personally Identifiable Information (PII), PII data will only be available in the production environment. So how do you do your tests in development? You can use sample data which is PII-clean for development purposes. However, this will miss errors caused by unexpected data that is not in the development dataset, so you will also need to test within production environments – which can feel uncomfortable for many continuous delivery practitioners.
  • In a big data environment, it will not be possible to test everything – volumes of data can be so large that you cannot expect to test against all of it.

We have used a variety of testing practices to overcome these challenges:

  • The extensive use of integration tests – providing mock-ups of critical interfaces or using smaller-scale databases with known data to give quick feedback on schemas, dependencies and data validation.
  • Implementing ‘development’ pipelines in the production environment with isolated ‘development’ clusters and namespaces. This brings testing to the production data, avoiding PII issues, and sophisticated data replication/emulation across environments.
  • Statistics-based testing against sampled production data for smaller feedback loops on data quality checks.
  • Using infrastructure-as-code testing tools to test whether critical resources are in place and correct (see https://www.equalexperts.com/blog/our-thinking/testing-infrastructure-as-code-3-lessons-learnt/ for a discussion of some existing tools).

Hopefully this gives a clearer overview of some of the essential practices needed to create an effective data pipeline. In the next blog post in this series, we finish our series by looking at the many pitfalls you can encounter in a data pipeline project. Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.  

Contact us!

If you’d like us to share our experience of data pipelines with you, get in touch using the form below.

A carefully managed data pipeline can provide you with seamless access to reliable and well-structured datasets.

A generalised form of transferring data from a source system A to a source system B, data pipelines are developed in small pieces, and integrated with data, logic and algorithms to perform complex transformations. To do this effectively, there are some essential practices that need to be adhered to.

In our data pipeline playbook we have identified eleven practices to follow when creating a data pipeline.  Here we touch on six of these practices such as how to start by using a steel thread, and in our next blog post we will talk about iteratively creating your data models as well as observing the pipeline.  Applying these practices will allow you to integrate new data sources faster at a higher quality as outlined in our recent post on the benefits of a data pipeline.

About this series

This is part four in our six-part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. In part three we considered the ‘must have’ key principles of data pipeline projects. Now we look at the six key practices needed for a data pipeline. Before we get into the details we just want to cover off what’s coming in the rest of the series. In part five we look at more of those practices, and in part six we look at the many pitfalls you can encounter in a data pipeline project. 

The growing need for good data engineering

Today, data engineers serve a wider audience than just a few years ago. As there is a growing need for organisations to apply machine learning techniques to their data, new challenges are faced by data engineers in order to remain relevant. Essential to every project is the ability to reliably deliver large-volume data sets so that data scientists can train more accurate models.

Aside from dealing with larger data volumes, these pipelines need to be flexible in order to accommodate the variety of data and the increasingly high processing velocity required. The following practices are those that we feel are essential to successful projects, the minimum requirement for success. They are based on our collective knowledge and experience gained across many data pipeline engagements.  

Practice 1: Build for the right latency

When designing the pipeline, it’s important to consider what level of latency you need. What is your speed of decision? How quickly do you need the data? Building and running a low latency, real-time data pipeline will be significantly more expensive, so make sure that you know you need one before embarking on that path. You should also ask how fast your pipeline can be. Is it even possible for you to have a real-time data pipeline? If all your data sources are produced by daily batch jobs, then the best latency you can reach will be daily updates, and the extra cost of real-time implementations will not provide any business benefits.

If you do need to be within real-time or near real-time, then this needs to be a key factor at each step of the pipeline. The speed of the pipe is conditioned by the speed of the slowest stage.

And be careful not to confuse the need for a real-time decision engine with the need for a real-time historical data store, such as a data warehouse for the data scientists. Decision models are created from stores of historical data and need to be validated before deployment into production. Model release usually takes place at a slower cadence (e.g., weekly or monthly). Of course, the deployed model will need to work on a live data stream, but we consider this part of the application development. This is not the appropriate use for a data warehouse or similar.

Practice 2: Keep raw data

Ingestions should start by storing raw data in the pipeline without making any changes. In most environments, data storage is cheap, and it is common to have all the ingested data persisted and unchanged. Typically, this is done via cloud file storage (S3, GCP Cloud Storage, Azure Storage), or HDFS for on-premise data.

Keeping this data allows you to reprocess it without re-ingestion if any business rule changes, and it also retains the possibility of new pipelines based on this data if, for example, a new dashboard is needed.

Practice 3: Break transformations into small tasks

Pipelines are usually composed of several transformations of the data, activities such as format validation, conformance against master data, enrichment, imputation of missing values, etc. Data pipelines are no different from other software and should thus follow modern software development practices of breaking down software units into small reproducible tasks. Each task should target a single output and be deterministic and idempotent. If we run a transformation on the same data multiple times, the results should always be the same.

By creating easily tested tasks, we increase the quality and confidence in the pipeline, as well as enhance the pipeline maintainability. If we need to add or change something on the transformation, we have the guarantee that if we rerun it, the only changes will be the ones we made.

Practice 4: Support backfilling

If the pipelines are mature at the start of development, it may not be possible to fully evaluate whether the pipeline is working correctly or not. Is this metric unusual because this is what always happens on Mondays, or is it a fault in the pipeline? We may well find at a later date that some of the ingested data was incorrect. Imagine you find out that during a month, a source was reporting incorrect results, but for the rest of the time, the data was correct.

We should engineer our pipelines so that we can correct them as our understanding of the dataflows matures. We should be able to backfill the stored data when we have identified a problem in the source or at some point in the pipeline, and ideally, it should be possible to backfill just for the corresponding period of time, leaving the data for other periods untouched.

Practice 5: Start with a steel thread

When starting at a greenfield site, we typically build up data pipelines iteratively around a steel thread – first a thin data pipe which is a thin slice through the architecture. This progressively validates the quality and security of the data. The first thread creates an initial point of value – probably a single data source, with some limited processing, stored where it can be accessed by at least one data user. The purpose of this first thread is to provide an initial path to data and uncover unexpected blockers, so it is selected for simplicity rather than having the highest end-user value. Bear in mind that in the first iteration, you will need to:

  • Create a cloud environment which meets the organisation’s information security needs.
  • Set up the continuous development environment.
  • Create an appropriate test framework.
  • Model the data and create the first schemas in a structured data store.
  • Coach end users on how to access the data.
  • Implement simple monitoring of the pipeline.

Later iterations will bring in more data sources and provide access to wider groups of users, as well as bringing in more complex functionality such as:

  • Including sources of reference or master data.
  • Advanced monitoring and alerting.

Practice 6: Utilise cloud – define your pipelines with infrastructure-as-code

Pipelines are a mixture of infrastructure (e.g., hosting services, databases, etc.), processing code, and scripting/configuration. They can be implemented using proprietary and/or open-source technologies. However, all of the cloud providers have excellent cloud native services for defining, operating and monitoring data pipelines. They are usually superior in terms of their ability to scale with increasing volumes, simpler to configure and operate, and support a more agile approach to data architecture.

Whichever solution is adopted, since pipelines are a mixture of components, it is critical to adopt an infrastructure-as-code approach. Only by having the pipeline defined and built using tools, such as terraform, and source controlled in a repository, will pipeline owners have control over the pipeline and the confidence to rebuild and refine it as needed.

Hopefully this gives a clearer overview of some of the essential practices needed to create an effective data pipeline. In the next blog post in this series, we will outline more of the practices needed for data pipelines.  Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.  

Contact us!

If you’d like us to share our experience of data pipelines with you, get in touch using the form below.

 

Facing an ever-growing set of new tools and technologies, high functioning analytics teams have come to rely increasingly on data engineers. Building and managing production data engineering pipelines is an inherently complex process, which can prove hard to scale without a systematic approach.

To help navigate this complexity, we have compiled our top advice for successful solutions. Here we examine some of the key guiding principles to help data engineers (of all experience levels) effectively build and manage data pipelines. These have been compiled using the experience of the data engineers at Equal Experts. They collectively recommend the adoption of these principles as they will help you lay the foundation to create sustainable and enduring pipelines.  

About this series

This is part three in our six part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. Now we consider the “must have” key principles of data pipeline projects. Before we get into the details, we just want to cover off what’s coming in the rest of the series. In part four, we look at the six key practices needed for a data pipeline. In part five we investigate more of those practices, and in part six we look at the many pitfalls you can encounter in a data pipeline project. 

The growing need for good data engineering

If I have learned anything from my years working as a data engineer, it is that practically every data pipeline fails at some point. Broken connections, broken dependencies, data arriving too late, or unreachable external systems or APIs. There are many reasons. But, regardless of the cause, we can do a lot to mitigate the impact of a data pipeline’s failure. These ‘must have’ principles are built up over the years to help to ensure that projects are successful. They are based on my knowledge, and the Equal Experts team’s collective experience, gained across many data pipeline engagements.  

Data pipelines are products

Pipelines bring data from important business sources. In many cases, they feed reports and analyses that endure for a long time. Unless your business does not expect to alter how it operates, or there are no amendments to low-level processes, the data pipelines will always need to adapt to the changes in the fundamental processes, new IT, or the data itself.  As something that should respond to and embrace regular change, pipelines should be treated as products rather than projects.

This means that there should be multi-year funding to monitor and maintain the existing pipelines. Providing headroom to add new ones, and supporting the analysis and retirement of old ones. Pipelines need product managers to understand the pipelines’ current status and operability, and to prioritise the work. (See this Forbes article for a wider description of working in product-mode over project-mode.)

Find ways for making common use of the data

The data collected for a given problem or piece of analysis will nearly always be useful in answering other questions. When creating pipelines, we try to architect them in a way that allows reuse, whilst also remaining lean in our implementation choices.

In many cases there are simple ways of achieving this. For example, there are usually a variety of places where data is stored in the pipeline. Raw ingested data might be useful for unanticipated purposes. And it can often be made available to skilled users by providing them access to the landing zone.

Appropriate identity and access technologies, such as role-based access, can support reuse while permitting strict adherence to data-protection policies and regulations. The fundamental architecture can stay the same, with access being provided by adding or amending access roles and permissions to data buckets, databases or data warehouses.

A pipeline should operate as a well-defined unit of work

Pipelines have a cadence driven by the need for decision-making and limited by the availability of source data. The developers and users of a pipeline should understand and recognise this as a well-defined unit of work – whether every few seconds, hourly, daily, monthly or event-driven.

Pipelines should be built around use cases

In general, we recommend building pipelines around the use case rather than the data source. This will help ensure that business value is achieved early. In some cases, the same data source might be important to several use cases, each with different cadences and access rights. Understanding when to reuse parts of pipelines and when to create new ones is an important consideration. For example, faster pipelines can always be used for slower cadences, but it typically requires more effort to maintain and adapt them. It might be simpler to create a simpler batch pipeline to meet a new low-latency use case that is not expected to change substantially than to focus on upgrading a fast-streaming pipe to meet the new requirements. 

Continuously deliver your pipelines

We want to be able to amend our data pipelines in an agile fashion as the data environment and needs of the business change. So, just like any other piece of working software, continuous delivery practices should be adopted to enable continuous updates of data pipelines in production. Adopting this mindset and these practices is essential to support continuous improvement and create feedback loops that rapidly expose problems and address user feedback.

Consider how you name and partition your data

Data pipelines are a mix of code and infrastructure that can become confusing as they grow if care is not taken with the naming. Pipelines will include at least a set of databases, tables, attributes, buckets, roles, etc., and they should be named in a consistent way to facilitate understanding and maintenance of the pipelines, as well as make the data meaningful to the end-users.

In many architectures, naming will directly affect how your data is partitioned, which in turn affects the speed of the search and retrieval of data. Consider what will be the most frequent queries when specifying bucket names, table partitions, shards, and so on.

Want to know more?

These guiding principles have been born out of our engineers and use each of their 10+ years of data engineering for end-to-end machine learning solutions. We are sure there are lots of other principles, so please do let us know of any approaches you have found effective in managing data pipelines. 

In our next blog post in this series we will start laying out some of the key practices of data pipelines.  Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.  

Contact us!

If you’d like us to share our experience of data pipelines with you, get in touch using the form below.

 

The six main benefits of an effective data pipeline

When you think of the technology tools that power a successful business, a data pipeline isn’t always at the top of the list. Because, although most forward thinking companies now realise data is one of their most valuable assets, the importance of data engineering is often underestimated. 

Yet modern data pipelines enable your business to quickly and efficiently unlock the data within your organisation. They allow you to extract information from its source, transform it into a usable form, and load it into your systems where you can use it to make insightful decisions. Do it well and you will benefit from faster innovation, higher quality (with improved reliability), reduced costs, and happy people. Do it badly, and you could lose a great deal of money, miss vital information or gain completely incorrect information.

In this article we look at how a successful data pipeline can help your organisation, as we attempt to unpack and understand the benefits of data pipelines.

About this series

This is part two in our six part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Now we look at the six main benefits of an effective data pipeline. Before we get into the details, we just want to cover off what’s coming in the rest of the series. In part three we consider the ‘must have’ key principles of data pipeline projects, parts four and five cover the essential practices of a data pipeline. Finally, in part six we look at the many pitfalls you can encounter in a data pipeline project. 

The benefits of a great data pipeline

Simply speaking, a data pipeline is a series of steps that move raw data from a source to a destination. In the context of business intelligence, a source could be a transactional database. The destination is where the data is analysed for business insights. In this journey from the source to the destination, transformation logic is applied to data to make it ready for analysis. There are many benefits to this process, here are our top six.

1 – Replicable patterns
Understanding data processing as a network of pipelines creates a way of thinking that sees individual pipes as examples of patterns in a wider architecture, which can be reused and repurposed for new data flows.

2 – Faster timeline for integrating new data sources
Having a shared understanding and tools for how data should flow through analytics systems makes it easier to plan for the ingestion of new data sources, and reduces the time and cost for their integration.

3 – Confidence in data quality

Thinking of your data flows as pipelines that need to be monitored and also be meaningful to end users, improves the quality of the data and reduces the likelihood of breaks in the pipeline going undetected.

4 – Confidence in the security of the pipeline

Security is built in from the first pipeline by having repeatable patterns and a shared understanding of tools and architectures. Good security practices can be readily reused for new dataflows or data sources.

5 – Incremental build
Thinking about your dataflows as pipelines enables you to grow your dataflows incrementally. By starting with a small manageable slice from a data source to a user, you can start early and gain value quickly.

6 – Flexibility and agility
Pipelines provide a framework where you can respond flexibly to changes in the sources or your data users’ needs.
Designing extensible, modular, reusable Data Pipelines is a larger topic and very relevant in Data Engineering. In the next blog post in this series, we will outline the principles of data pipelines.  Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.  

Contact us!
If you’d like us to share our experience of data pipelines with you, get in touch using the form below.

 

It is common to hear that ‘data is the new oil,’ and whether you agree or not, there is certainly a lot of untapped value in much of the data that organisations hold.

Data is like oil in another way – it flows through pipelines. A data pipeline ensures the efficient flow of data from one location to the other. A good pipeline allows your organisation to integrate new data sources faster, provide patterns that you can replicate, gives you confidence in your data quality, and builds in security. But, data flow can be precarious and, when not given the correct attention, it can quickly overwhelm your organisation. Data can leak, become corrupted, and hit bottlenecks and, as the complexity of the requirements grow, and the number of data sources multiplies, these problems increase in scale and impact.

About this series

This is part one in our six part series on the data pipeline, taken from our latest playbook. Here we look at the very basics – what is a data pipeline and who is it used by? Before we get into the details, we just want to cover off what’s coming in the rest of the series. In part two, we look at the six main benefits of a good data pipeline, part three considers the ‘must have’ key principles of data pipeline projects, and parts four and five cover the essential practices of a data pipeline. Finally, in part six we look at the many pitfalls you can encounter in a data pipeline project. 

Why is a data pipeline critical to your organisation?

There is a lot of untapped value in the data that your organisation holds. Data that is critical if you take data analysis seriously. Put to good use, data can identify valuable business insights on your customers and your operations. However, to find these insights, the data has to be regularly, or even continuously, transported from the place where it is generated to a place where it can be analysed.

A data pipeline, consolidates data from all your disparate sources into one (or multiple) destinations, to enable quick data analysis. It also ensures consistent data quality, which is absolutely crucial for reliable business insights. 

So what is a data pipeline?

A data pipeline is a set of actions that ingest raw data from disparate sources and move the data to a destination for storage and analysis. We like to think of this transportation as a pipeline because data goes in at one end and comes out at another location (or several others). The volume and speed of the data are limited by the type of pipe you are using and pipes can leak – meaning you can lose data if you don’t take care of them.

The data engineers who create a pipeline are a critical service for any organisation. They create the architectures that allow the data to flow to the data scientists and business intelligence teams, who generate the insight that leads to business value.

A data pipeline is created for data analytics purposes and has:

Data sources – These can be internal or external and may be structured (e.g., the result of a database call), semi-structured (e.g., a CSV file or a Google Sheets file), or unstructured (e.g., text documents or images).

Ingestion process – This is the means by which data is moved from the source into the pipeline (e.g., API call, secure file transfer).

Transformations – In most cases, data needs to be transformed from the input format of the raw data, to the one in which it is stored. There may be several transformations in a pipeline.

Data quality/cleansing – Data is checked for quality at various points in the pipeline. Data quality will typically include at least validation of data types and format, as well as conforming with the master data.

Enrichment – Data items may be enriched by adding additional fields, such as reference data.

Storage – Data is stored at various points in the pipeline, usually at least the landing zone and a structured store (such as a data warehouse).

End users – more information on this is in the next section.

So, who uses a data pipeline?

We believe that, as in any software development project, a pipeline will only be successful if you understand the needs of the users. 

Not everyone uses data in the same way. For a data pipeline, the users are typically:

Business intelligence/management information analysts, who need data to create reports; 

Data scientists who need data to do an in-depth analysis of point problems or create algorithms for key business processes (we use ‘data scientist’ in the broadest sense, including credit risk analysts, website analytics experts, etc.)

Process owners, who need to monitor how their processes are performing and troubleshoot when there are problems.

Data users are skilled at visualising and telling stories with data, identifying patterns, or understanding significance in data. Often they have strong statistical or mathematical backgrounds. And, in most cases, they are accustomed to having data provided in a structured form – ideally denormalised – so that it is easy to understand the meaning of an individual row of data without the need to query separate tables or databases.

Is a data pipeline a platform?

Every organisation would benefit from a place where they can collect and analyse data from different parts of the business. Historically, this has often been met by a data platform, a centralised data store where useful data is collected and made available to approved people. 

But, whether they like it or not, most organisations are, in fact, a dynamic mesh of data connections which need to be continually maintained and updated. Following a single platform pattern often leads to a central data engineering team tasked with implementing data flows. 

The complexities of meeting everyone’s needs and ensuring appropriate information governance, as well as a lack of self-service, often make it hard to ingest new data sources. This can then lead to backlog buildup, frustrated data users, and frustrated data engineers. 

Thinking of these dataflows as a pipeline changes the mindset away from monolithic solutions, to a more decentralised way of thinking – understanding what pipes and data stores you need and implementing them the right way for that case whilst reusing where appropriate.

So now we have understood a little more about the data pipeline, what it is and how it works, we can start to understand the benefits and assess whether they align with your digital strategy.  We cover these in the next blog article, ‘What are the benefits of data pipelines?’

For more information on the data pipeline in general, take a look at our Data Pipeline Playbook.  And if you’d like us to share our experience of the data pipeline with you, get in touch using the form below.

 

 

If you’re a senior IT leader,  I’d like to make a prediction. You have faced a key data governance challenge at some time. Probably quite recently. In fact, there is a good chance that you’re facing one right now. I know this to be true, because clients approach us frequently with this exact issue. 

However, it’s not a single issue. In fact, over time we have come to realise that data is a slippery term that means different things for different people. Which is why we felt that deeper investigation into the subject was needed, to gain clarity and understanding around this overloaded term and to establish how we can talk to clients who see data governance as a challenge. 

So, what is data governance? And what motivates an organisation to be interested in it?

Through a series of surveys, discussions and our own experiences, we have come to the conclusion that client interest in data governance is motivated by the following wide range of reasons.

1. Data Security/Privacy

I want to be confident that I know the right measures are in place to secure my data assets and that we have the right protections in place.

2. Compliance – To meet industry requirements

I have specific regulations to meet (e.g. health, insurance, finance) such as:

  • Storage – I need to store specific data items for specified periods of time (or I can only store for specific periods of time).
  • Audit – I need to provide access to specified data for audit purposes.
  • Data lineage/traceability – I have to be able to show where my data came from or why a decision was reached.
  • Non-repudiation – I have to be able to demonstrate that the data has not been tampered with.

3. Data quality

My data is often of poor quality, it is missing data points, the values are often wrong, or out of date and now no-one trusts it. This is often seen in the context of central data teams charged with providing data to business functions such as operations, marketing etc. Sometimes data stewardship is mentioned as a means of addressing this.

4. Master/Reference Data Management

When I look at data about the same entities in different systems I get different answers.

5. Preparing my data for AI and automation

I am using machine learning and/or AI and I need to know why decisions are being made (as regulations around the use of AI and ML mature this is becoming more pressing – see for example https://ico.org.uk/for-organisations/guide-to-data-protection/key-data-protection-themes/explaining-decisions-made-with-ai/).

6. Data Access/Discovery

I want to make it easier for people to find data or re-use data – it’s difficult for our people to find and/or access data which would improve our business. I want to overcome my data silos. I want data consumers to be able to query data catalogues to find what they need.

7. Data Management

I want to know what data we have e.g. by compiling data dictionaries. I want more consistency about how we name data items. I want to employ schema management and versioning.

8. Data Strategy

I want to know what strategy I should take so my organisation can make better decisions using data. And how do I quantify the benefits?

9. Creating a data-driven organisation

I want to create an operating model so that my business can manage and gain value from its data.

I think it’s clear from this that there are many concerns covered by the term data governance. You probably recognise one, or maybe even several, as your own. So what do you need to do to overcome these? Well, now we understand the variety of concerns, we can start to address the approach to a solution. 

Understanding Lean Data Governance

Whilst it can be tempting for clients to look for an off-the-shelf solution to meet their needs, in reality, they are too varied to be met by a single product. Especially as many of the concerns are integral to the data architecture. Take data lineage and quality as examples that need to be considered as you implement your data pipelines – you can’t easily bolt them on as an afterthought.

Here at Equal Experts, we advocate taking a lean approach to data governance – identify what you are trying to achieve and implement the measures needed to meet them. 

The truth is, a large proportion of the concerns raised above can be met by following good practices when constructing and operating data architectures – the sorts of practices that are outlined in our Data Pipeline and Secure Delivery playbooks.  

We have found that good data governance emerges by applying these practices as part of delivery. For example:

  • Most Data security concerns can be met by proven approaches – taking care during environment provisioning, implementing role-based access control, implementing access monitoring and alerts and following the principles that security is continuous and collaborative.
  • Many Data Quality issues can be addressed by implementing the right measures in your data pipelines – incorporating observability through the pipelines – enabling you to detect when changes happen in data flows; and/or pragmatically applying master and reference data so that there is consistency in data outputs. 
  • Challenges with data access and overcoming data silos are improved by constructing data pipelines with an architecture that supports wider access. For example our reference architecture includes data warehouses for storing curated data as well as landing zones which can be opened up to enable self-service for power data users. Many data warehouses include data cataloguing or data discovery tools to improve sharing.
  • Compliance challenges are often primarily about data access and security (which we have just addressed above) or data retention which depends on your pipelines. 

Of course, it is important that implementing these practices is given sufficient priority during the delivery. And it is critical that product owners and delivery leads ensure that they remain in focus. The tasks that lead to good Data Governance can get lost when faced with excessive demands for additional user features. In our experience this is a mistake, as deprioritising governance activities will lead to drops in data quality, resulting in a loss of trust in the data and in the end will significantly affect the user experience.

Is Data Governance the same as Information Governance?

Sometimes we also hear the term Information Governance. Information Governance usually refers to the legal framework around data. It defines what data needs to be protected and any processes (e.g. data audits), compliance activities or organisational structures that need to be in place. GDPR is an Information Government requirement – it specifies what everyone’s legal obligations are in respect of the data they hold, but it does not specify how to meet those obligations. Equal Experts does not create information governance policies, although we work with client information governance teams to design and implement the means to meet them.

The field of data governance is inherently complex, but I hope through this article you’ve been able to glean insights and understand some of the core tenets driving our approach. 

These insights and much more are in our Data Pipeline and Secure Delivery playbooks. And, of course, we are keen to hear what you think Data Governance means. So please feel free to get in touch with your questions, comments or additions on the form below.