Supporting Austrade to embrace world-class data practice

Last month, I had the great honour of being invited to speak at Austrade’s Data Champions conference, a quarterly convention for the Federal Government’s Australian Trade and Investment Commission.

Serving a wide range of stakeholders—both at home and internationally—the Australian Trade and Investment Commission (also known as Austrade) delivers services to grow Australia’s economic prosperity. It’s all about helping businesses go further, faster. Obviously, data plays a huge role in their approach to high quality service provision.

To support the team in their excellent work, I was happy to deliver a brief presentation on the importance and value of data pipelines.

This included some of the new ways we think about, and work with, data at Equal Experts. Plus—given Austrade’s focus on ‘connecting Australian businesses with the world’—some of our learnings from embedding leading data practice through our ongoing collaborations with Her Majesty’s Revenue & Customs (HMRC); the United Kingdom’s equivalent of the Australian Taxation Office (ATO). Over a working partnership of many years, we’re proud to note that Equal Experts is one of the top five resource providers for HMRC. We’ve worked together on everything from cutting-edge fraud detection mechanisms to transitioning from physical infrastructure to the cloud.

In contrast, this event was singularly focused and all about data.

Here are some of the key points that seemed to resonate throughout the presentation.

1. Data can—and should—be agile.

Historically, and in many large-scale organisations around the world today, there’s a tendency to conceptualise and treat data in a certain way. In many instances, it’s a case of approaching data without any sense of fluidity or immediacy.

In the old world, data isn’t something to be dynamically accessed. It’s built up over time and then used to create reports retrospectively based on insights gleaned from the material. Additionally, there’s a prevailing conception that data is slow to establish; you need to build up expanses of information before any meaningful implementation.

In fact, the opposite is true. With the right approach, you can act much more fluidly and create real business value in real-time.

For example, we design and implement data pipelines with highly reusable patterns; this ensures organisations can rapidly create new data pipelines as use-cases or business requirements evolve. And they do, and should, evolve.

If something changes in your organisation—the source of data, the use case associated with the data being collected, the utility of how or why that data is important—then you need to evolve your pipeline(s) to reflect those developments. This is where the concept of agile data practice comes to the fore.

There’s a valuable practice of continually reviewing data pipelines, which many organisations fail to realise. Data collection and collation isn’t a set-and-forget proposition, unless your organisation itself is comfortable in stasis.

We typically approach data practice through the lens of agile delivery, with practices and rituals such as:

  • Discovery and stakeholder engagement: Collect the necessary context for any data set by incorporating business drivers, a range of data sources, your current or desired capabilities, and the reality of your IT systems. This de-risks delivery by ensuring you have everything in place to hit the ground running: clear scope, visible dependencies, defined ways of working, and a delivery plan.
  • Iterative delivery: Short sprints with continuous feedback help deliver value rapidly and frequently throughout the process. Competitors talk about data in the context of delivering value by the 6-month mark; we prefer to deliver cyclical value every 3-4 weeks.
  • Continuous review: Regular review intervals enable stakeholders to continuously validate progress and decide when, and how, to release end users.
  • Launch and refine: Collect feedback and refine things using a data-driven approach.

2. Keep unstructured raw inputs separate from any processed data streams.

In terms of prioritising and maintaining a level of flexibility, it’s in your best interest to keep unstructured raw inputs separate from any processed data streams. This ensures you minimise the requirement to develop new end-to-end pipelines for new use cases. You simply draw data from the existing unstructured data for specific requirements as they become apparent.

By following this practice, you can keep your data flexible, agile, and easy-to-update. Which, in turn, facilitates far more value—often in real-time—from the information you cultivate.

3. Build your technical infrastructure around your business infrastructure: start with the use-case.

This relates to another crucial practice outlined in our Data Pipelines Playbook. It’s essential that you think of data pipelines as products, not projects. This means your data pipeline should have a product owner: someone who can prioritise deliverables and assist in defining use cases.

These use cases are critical. Effective data practice always starts with the use-case, rather than the technical implementation. A technical architecture must be driven by a business architecture, which should include the actual environment of the organisation in question.

You simply cannot define a high performance, highly effective technical architecture without that fundamental context of business or organisational requirements. And business requirements are often defined by the use-cases of the end users of the system, and the data that it generates.

Without detailed understanding of those use-cases, how do you calibrate and measure the efficacy of your solution?

If you’re ready to embed leading data practice at the core of your organisation, let’s tee up a conversation.

Alternatively, take a look through some of our other pieces on data pipelines:

As we emerge from the pandemic, for many businesses the biggest concern isn’t being too bold – it’s being too cautious.

Business leaders are looking to accelerate transformation and deliver ambitious new services that are invariably delivered through technology. IT leaders are in the hot seat, and that’s a worry if you’re not 100% confident in your data.

Can you guarantee that data quality meets requirements? Do you have the systems and skills to integrate data from multiple platforms, silos and applications? Can you track where data comes from, and how it is processed at each stage of the journey?

If not, you’ve got a data governance problem.  

Without strong, high-quality governance, organisations are at the mercy of inaccurate, insufficient and out-of-date information. That puts you at risk of making poor decisions that lead to lost business opportunities, reputational damage and reduced profits – and that’s just for starters.

What does high-quality data governance look like?

It’s likely that the IT department will own data governance, but the strategy must be mapped to wider business goals and priorities.

As a rough guideline, here are 10 key things that we think must be a part of an effective data governance strategy:

  1.     Data security/privacy: do we have the right measures in place to secure data assets?
  2.     Compliance: are we meeting industry and statutory requirements in areas such as storage, audit, data lineage and non-repudiation.
  3.     Data quality: do we have a system in place to identify data that is poor quality, such as missing data points, incorrect values or out-of-date information? Is such information corrected efficiently, to maintain trust in our data? 
  4.     Master/Reference data management: If I look at data in different systems, do I see different answers?
  5.     Readiness for AI/automation: If we are using machine learning or AI, do I know why decisions are being made (in line with regulations around AI/ML)
  6.     Data access/discovery: Are we making it easier for people to find and reuse data? Can data consumers query data catalogues to find information, or do we need to find ways to make this easier?
  7.     Data management: Do we have a clear overview of the data assets we have? This might require the creation of data dictionaries and schema that allow for consistent naming of data items and versioning.
  8.     Data strategy: What business and transformation strategy does our data support? How does this impact the sort of decisions we make?
  9.     Do we need to create an operating model so the business can manage – and gain value from – this data?

Moving from data policy to data governance

As we can see, data governance is about more than simply having an IT policy that covers the collection, storage and retention of data. Effective, high-level data governance needs to ensure that data is supporting the broader business strategy and can be accessed and relied upon to support timely and accurate decision-making.

So how do IT leaders start to move away from the first view of governance to the latter? `

While it can be tempting for organisations to buy an off-the-shelf solution for data governance, it’s unlikely to meet your needs, and may not align with your strategic goals.

Understanding your strategy first means the business can partner with IT to identify the architecture changes that might be needed, and then identify solutions that will meet these needs.

Understanding Lean Data Governance

Here at Equal Experts, we advocate taking a lean approach to data governance – identify what you are trying to achieve and implement the measures needed to meet them.

The truth is that a large proportion of the concerns raised above can be met by following good practices when constructing and operating data architectures. You’ll find more information about best practices in our Data Pipeline and Secure Delivery playbooks.

The quality of data governance can be improved by applying these practices. For example:

  • It’s possible to address data security concerns using proven approaches such as careful environment provisioning, role-based access control and access monitoring.
  • Many data quality issues can be resolved by implementing the correct measures in data pipelines, such as incorporating observability so that you can see if changes happen in data flows, and pragmatically applying master and reference data so that there is consistency in data outputs.
  • To improve data access and overcome data silos, organisations should construct data pipelines with an architecture that supports wider access.
  • Compliance issues are often related to data access and security, or data retention. Good implementation in these areas makes achieving compliance much more straightforward.

The field of data governance is inherently complex, but I hope through this article you’ve been able to glean insights and understand some of the core tenets driving our approach.

These insights and much more are in our Data Pipeline and Secure Delivery playbooks. And, of course, we are keen to hear what you think Data Governance means to your organisation. So please feel free to get in touch with your questions, comments or additions on the form below.

Data is the lifeblood of the modern organisation, providing the insights we rely on for everything from digital strategy to customer service priorities. But what happens when your data isn’t delivering as expected? Poor data quality, visibility or analytics can derail or delay critical business activities and needs to be addressed. 

Equal Experts offers a data health check service to help clients deliver better, more reliable data from their IT systems. We can audit your current data processes and provide practical guidance on what changes are needed to improve data processes. 

What is a data health check?  

The data health check is a service that Equal Experts offers to clients to help them understand and diagnose challenges they are experiencing around data. 

We’ll spend around two weeks with your team to understand what you want to achieve with data, and what improvements you can make to meet your data goals. We’ll spend time talking about your current systems and processes, and performing checks on key systems. 

At the end of this process, you’ll receive a report that contains specific, actionable recommendations. We’ll outline short and longer-term changes that you can make to improve your situation and meet your data goals. 

When do I need a data health check? 

A data health check can help if organisations spot a sudden, unexpected change in their data. For example, you might be struggling with an ongoing issue related to the speed, quality or accuracy of data analytics in your business.  

We can also help with acute data quality and performance issues, when they arise. We recently worked with an organisation that was tracking conversions in one part of their business, when the number of transactions suddenly fell, without any obvious cause. 

Our experts were able to identify an interoperability issue that was allowing some data to get ‘lost’ during the tracking process. We advised the client on how to address the immediate problem, and how to improve its data visibility and alerts processes to avoid similar issues happening in future. 

What does a data health check report look like? 

The data health check is specific to your organisation, systems and processes, so it will look different to every customer. However, we will always provide a comprehensive report to clients at the end of each data health check. 

This report will generally examine your vision (what you want to achieve), the symptoms (things that are causing problems),  causes (root causes such as technology or processes), and remedies (our recommended solutions). 

The recommendations made by our team might be very specific, such as rebuilding a particular table, or partitioning data sets differently. They can also be more strategic, for example, advising on a new data visualisation platform or BI dashboard. 

If appropriate, we can provide deep dive reports, which focus on one specific issue or area of concern. 

What happens during a health check? 

During the health check, our team will interview key stakeholders such as data users, engineers, architects and programme managers. We’ll also look at backlogs, databases, code repositories and any other information that is relevant to the health check process.

Throughout the process you can expect regular feedback, and access to a collaborative Miro board, where you can see the latest findings and insights about your data quality. 

If you are interested to know more about performing a data health check and what it could mean for your organisation, feel free to get in touch on the form below.

As a follow-up from Language Agnostic Data Pipelines, the following post is focused on the use of dbt (data build tool).

Dbt is a command-line tool that enables us to transform the data inside a Data Warehouse by writing SQL select statements which represent the models. There is also a paid version with a web interface, dbt cloud, but for this article let’s consider just the command-line tool.

The intent of this article is not to make a tutorial about dbt – that already exists here, nor one about TDD, the goal is to illustrate how one of our software development practices, test-driven development, can be used to develop the dbt models.

Testing strategies in dbt

Dbt has two types of tests:

  • Schema tests: Applied in YAML, returns the number of records that do not pass an assertion — when this number is 0, all records pass and therefore your test passes.
  • Data tests: Specific queries that return 0 records.

Both tests can be used against staging/production data to detect data quality issues.

The second type of test gives us more freedom to write data quality tests. These tests run against a data warehouse loaded with data. They can run on production, on staging, or for instance against a test environment where a sample of data was loaded. These tests can be tied to a data pipeline so they can continuously test the ingested and transformed data.

Using dbt data tests to compare model results

With a little bit of SQL creativity, the data tests (SQL selects) can be naively* used to test model transformations, comparing the result of a model with a set of expectations:

with expectations AS (

   select 'value' as column1,

   union all 

   Select 'value 2' as column1

)


select * from expectations

except

select * from analytics.a_model

The query returns results when the expectations differ, so in this case dbt reports a test failure. However, this methodology isn’t effective to test the models due to the following facts:

  • The test input is shared among all the tests (this could be overcome by executing dbt test and the data setup for each test, although it’s not practical due to the lack of clarity and the maintainability of test suites).
  • The test input is not located inside the test itself, so it’s not user friendly to code nor easy to understand the goal of each test.
  • The dbt test output doesn’t show the differences between the expectations and the actual values, which slows down the development.
  • For each test, we need to have a boilerplate query with the previous format (with expectations as…).

Considering these drawbacks, It doesn’t seem like the right tool to make model transformation tests.

A strategy to introduce a kind of ‘data unit tests’

It’s possible and common to combine SQL with the templating engine Jinja (https://docs.getdbt.com/docs/building-a-dbt-project/jinja-macros). Also, It’s possible to define macros which can be used to extend dbt’s functionalities. That being said, let’s introduce the following macro:

unit_test(table_name, input, expectations)

The macro receives:

  • A table name (or a view name).
  • An input value that contains a set of inserts.
  • A table of expectations.

To illustrate the usage of the macro, here is our last test case refactored:

{% set table_name = ref('a_model') %}


{% set input %}

insert into a_table(column1) values (‘value’), (‘value2’);

{% endset %}


{% set expectations %}

select 'value' as column1,

union all 

select 'value 2' as column1

{% endset %}


{{ unit_test(table_name, input, expectations) }}

There is some boilerplate when using Jinja to declare the variables to call the unit test macro. Although, it seems a nice tradeoff, because this strategy enables us to:

  • Simplify the test query boilerplate.
  • Setup input data in plain SQL and in the same file.
  • Setup expectations in plain SQL and in the same file.
  • Run each test segregated from other tests.
  • Show differences when a test fails.

To illustrate the usage of this approach, here is a demo video:



The previous macro will be available in the repo published with the Language Agnostic Data Pipelines.

*naively coded because the use of EXCEPT between both tables fails to detect if duplicate rows exist. It could be fixed easily, but for illustrative purposes, we preferred to maintain the example as simple as we can.

Bringing software engineering practices to the data world

It is also easy to apply other standard software development practices such as integration with a ci/cd environment in dbt. This  is one of the advantages of using it over transforming data inside ETL tools which use a visual programming approach.

Wrapping up, we advocate that data oriented projects should always use the well-known software engineering best practices. We hope that this article shows how you can apply TDD  using the  emerging DBT data transformation tool.

Pedro Sousa​ paired on this journey with me. He is taking the journey from software engineering to data engineering in our current project, and he helped on the blog post.

Contact us!

For more information on data pipelines in general, take a look at our Data Pipeline Playbook.  And if you’d like us to share our experience of data pipelines with you, get in touch using the form below.

Based on the experience shared in evolving a client’s data architecture, we decided to share a reference implementation of data pipelines. Recalling from the data pipeline playbook.

What is a Data Pipeline? 

From the EE Data Pipeline playbook:

A Data Pipeline is created for data analytics purposes and has:

  • Data sources – these can be internal or external and may be structured (e.g. the result of a database call), semi-structured (e.g. a CSV file or a Google Sheet), or unstructured (e.g. text documents or images).
  • Ingestion process – the means by which data is moved from the source into the pipeline (e.g. API call, secure file transfer).
  • Transformations – in most cases data needs to be transformed from the input format of the raw data, to the one in which it is stored. There may be several transformations in a pipeline.
  • Data Quality/Cleansing – data is checked for quality at various points in the pipeline. Data quality will typically include at least validation of data types and format, as well as conforming against master data. 
  • Enrichment – data items may be enriched by adding additional fields such as reference data.
  • Storage – data is stored at various points in the pipeline. Usually at least the landing zone and a structured store such as a data warehouse.

Functional requirements

  • Pipelines that are:
    • Easy to orchestrate
    • Support scheduling 
    • Support backfilling
    • Support testing on all the steps
    • Easy to integrate with custom APIs as sources of data
    • Easy to integrate in a CI/CD environment
  • The code can be developed in multiple languages to fit each client skill set when python is not a first class citizen. 

Our strategy 

In some situations a tool like Matillion, Stitchdata or Fivetran can be the best approach, although it’s not the best choice for all of our client’s use cases. These ETL tools work well when using the existing pre-made connectors, although when the majority of the data integrations are custom connectors, it’s certainly not the best approach. Apart from the known cost, there is also an extra cost when using these kinds of tools – the effort to make the data pipelines working in a CI/CD environment. Also, at Equal Experts, we advocate we should test each step of the pipeline, and if possible, develop them using test driven development – and this is near impossible in these cases.

That being said, for the cases when an ETL tool won’t fit our needs, we identified the need of having a reference implementation that we can use for different clients. Since the skill set of each team is different, and sometimes Python is not an acquired skill, it was decided not to use the well known python tools that are used these days for data pipelines like  Apache Airflow or Dagster. 

So we designed a solution using Argo Workflows as the orchestrator. We wanted something which allowed us to define the data pipelines as DAGs like Airflow. 

Argo Workflows is a container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo represents workflows as Dags (Directed Acyclic Graphs), and each step of the workflow is a container. Since data pipelines can be easily modeled as workflow it is a great tool to use. Also, we have freedom to choose which programming language to design the connectors or the transformations, the only requirement is that each step of the pipeline should be containerised.

For the data transformations, we found that dbt was our best choice. Dbt allows the transformations needed between the staging tables and the analytics tables. Dbt is SQL centric, so there isn’t a need to learn another language. Also, dbt has features that we wanted like testing and documentation generation and has native connections to Snowflake, BigQuery, Redshift and Postgres data warehouses. 

With these two tools, that is how we ended up with a language agnostic data pipelines architecture that can be easily reused and adapted in multiple cases and for different clients.

Reference implementation

Because we value knowledge sharing, we have created a public reference implementation of this architecture in the github repo which shows a pipeline for a simple use case of ingesting UK COVID-19 data (https://api.coronavirus.data.gov.uk) as an example.

The goal of the project is to have a simple implementation that can be used as an accelerator to other teams. It can be easily adapted to make other data pipelines, to integrate in a CI/CD environment, or to extend the approach and make it work for different scenarios. 

The sample project uses a local kubernetes cluster to deploy Argo and the containers which represent the data pipeline. Also a database where COVID-19 data is loaded and transformed and an instance of Metabase to show the data in a friendly dashboard.

We’re planning to add into the reference implementation infrastructure as code to deploy the project on AWS and GCP. Also, we might also work in aspects like facilitating the monitoring of the data pipelines when deployed in a cloud, or using Great Expectations.

Transparency is at the heart of our values

We value knowledge sharing and collaboration, so we hope that this article, along with the data pipelines playbook will help you to start creating data pipelines in whichever language you choose. 

Contact us!

For more information on data pipelines in general, take a look at our Data Pipeline Playbook.  And if you’d like us to share our experience of data pipelines with you, get in touch using the form below.

 

In the mid 2010’s there was a step change in the rate at which businesses started to focus on gaining valuable insights from data.

As the years have passed, the importance of data management has started to sink in throughout the industry. Organisations have realised that you can build the best models, but if your data isn’t qualitative, your results will be wrong.

There are many, varied job roles within the data space. And I always thought the distinction of the roles were pretty obvious. However, recently a lot has been written about the difference between the different data roles, and more specifically the difference between Data Scientists and Data Engineers. 

I think it’s important to understand that not knowing these differences can be instrumental in teams failing or underperforming with data. Which is why I am writing this article. To attempt to clarify the roles, what they mean, and how they fit together. I hope that this will help you to understand the differences between a Data Scientist and a Data Engineer within your organisation.

What do the Data Engineer and Data Scientist roles involve?

So let’s start with the basics. Data Engineers make data available to the business, and Data Scientists enable decisions to be made with the data. 

Data Engineers, at a senior level, design and implement services that enable the business to gain access to its data. They do this by building systems that automagically ingest, transform and publish data, whilst gathering relevant metadata (lineage, quality, category, etc.), enabling the right data to be utilised.  

Data Scientists not only utilise the data made available, but also uncover additional data that can be combined and processed to solve business problems.  

Both Data Scientists and Data Engineers apply similar approaches to their work.  They identify a problem, they look for the best solution, then they implement the solution. The key difference is the problems they look at and, depending on their experience, the approach taken to solving it.  

Data Engineers like Software Engineers, or even more generally engineers, tend to use a process of initial development, refinement and automation.  

Initial development, refinement and automation explained, with cars.

In 1908 Henry Ford released the Model T Ford. As you can see, it has many of the same features as a modern car – wheels on each corner, a bonnet, a roof, seats, a steering wheel, brakes, gears.  

 

In 1959 the first Mini was released.  It had all the same features as the Model T Ford. However, it was more comfortable, cheaper, easier to drive, easier to maintain, and more powerful. It also incorporated new features like windscreen wipers, a radio, indicators, rear view mirrors. Basically, the car had, over 50 years, been incrementally improved.  

Step forward in time to 2010, and Tesla released the Models S and X. These too have many features we can see in the Model T Ford and the Mini.  But now they also contain some monumental changes.

The internal combustion engine is replaced with electric drive. It has sat-nav, autopilot, and even infotainment. All of which combine to make the car much easier and more pleasurable to drive.

What we are seeing is the evolution of the car from the initial production line – basic but functional – through multiple improvements in technology, safety, economy, driver and passenger comforts. All of which improve the driving experience.  

In other words we are seeing initial development, refinement and automation. A process that Data Engineers and Data Scientists know only too well.

For Data Engineers the focus is on data, getting it from source systems to targets, ensuring the data quality is qualified, the lineage captured, the attributes tagged, and the access controlled. 

What about Data Scientists?  They absolutely follow the same pattern, but they additionally look to develop analytics along the Descriptive, Diagnostic, Predictive, Prescriptive scale.  

So why is there confusion between the Data Scientist and Data Engineer roles?  

There is of course not a single answer but some of the common reasons include:

  • At the start, both Data Scientist and Data Engineers spend a lot of time Data Wrangling. This means trying to get the data into a shape where it can be used to deliver business benefits.
  • At first, the teams are often small and they always work very closely together, in fact, in very small organisations they may be the same person – so it’s easy to see where the confusion might come from.
  • It’s often given to Data Engineers to “productionise” analytics model created by Data Scientists.
  • Many Data Engineers and Data Scientists dabble in each other’s areas, as there are many skills both roles need to employ. These can include data wrangling, automation and algorithms..  

As the seniority of data roles develop, so do the differences.

When I talk to and work with Data Engineers and Data Scientists, I can often categorise them into one of three categories – Junior, Seasoned, Principal – and when I work with Principals, in either space, you can tell they are a world apart in their respective fields.  

So what differentiates the different levels and roles?

That’s it. I hope this article helps you to more easily understand the differences between a Data Scientist and a Data Engineer. I also hope this helps you to more easily identify both within your organisation.  If you’d like to learn more about our Data Practice at Equal Experts, please get in touch using the form below.

 

Facing an ever-growing set of new tools and technologies, high functioning analytics teams have come to rely increasingly on data engineers. Building and managing production data engineering pipelines is an inherently complex process, which can prove hard to scale without a systematic approach.

To help navigate this complexity, we have compiled our top advice for successful solutions. Here we examine some of the key guiding principles to help data engineers (of all experience levels) effectively build and manage data pipelines. These have been compiled using the experience of the data engineers at Equal Experts. They collectively recommend the adoption of these principles as they will help you lay the foundation to create sustainable and enduring pipelines.  

About this series

This is part three in our six part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. Now we consider the “must have” key principles of data pipeline projects. Before we get into the details, we just want to cover off what’s coming in the rest of the series. In part four, we look at the six key practices needed for a data pipeline. In part five we investigate more of those practices, and in part six we look at the many pitfalls you can encounter in a data pipeline project. 

The growing need for good data engineering

If I have learned anything from my years working as a data engineer, it is that practically every data pipeline fails at some point. Broken connections, broken dependencies, data arriving too late, or unreachable external systems or APIs. There are many reasons. But, regardless of the cause, we can do a lot to mitigate the impact of a data pipeline’s failure. These ‘must have’ principles are built up over the years to help to ensure that projects are successful. They are based on my knowledge, and the Equal Experts team’s collective experience, gained across many data pipeline engagements.  

Data pipelines are products

Pipelines bring data from important business sources. In many cases, they feed reports and analyses that endure for a long time. Unless your business does not expect to alter how it operates, or there are no amendments to low-level processes, the data pipelines will always need to adapt to the changes in the fundamental processes, new IT, or the data itself.  As something that should respond to and embrace regular change, pipelines should be treated as products rather than projects.

This means that there should be multi-year funding to monitor and maintain the existing pipelines. Providing headroom to add new ones, and supporting the analysis and retirement of old ones. Pipelines need product managers to understand the pipelines’ current status and operability, and to prioritise the work. (See this Forbes article for a wider description of working in product-mode over project-mode.)

Find ways for making common use of the data

The data collected for a given problem or piece of analysis will nearly always be useful in answering other questions. When creating pipelines, we try to architect them in a way that allows reuse, whilst also remaining lean in our implementation choices.

In many cases there are simple ways of achieving this. For example, there are usually a variety of places where data is stored in the pipeline. Raw ingested data might be useful for unanticipated purposes. And it can often be made available to skilled users by providing them access to the landing zone.

Appropriate identity and access technologies, such as role-based access, can support reuse while permitting strict adherence to data-protection policies and regulations. The fundamental architecture can stay the same, with access being provided by adding or amending access roles and permissions to data buckets, databases or data warehouses.

A pipeline should operate as a well-defined unit of work

Pipelines have a cadence driven by the need for decision-making and limited by the availability of source data. The developers and users of a pipeline should understand and recognise this as a well-defined unit of work – whether every few seconds, hourly, daily, monthly or event-driven.

Pipelines should be built around use cases

In general, we recommend building pipelines around the use case rather than the data source. This will help ensure that business value is achieved early. In some cases, the same data source might be important to several use cases, each with different cadences and access rights. Understanding when to reuse parts of pipelines and when to create new ones is an important consideration. For example, faster pipelines can always be used for slower cadences, but it typically requires more effort to maintain and adapt them. It might be simpler to create a simpler batch pipeline to meet a new low-latency use case that is not expected to change substantially than to focus on upgrading a fast-streaming pipe to meet the new requirements. 

Continuously deliver your pipelines

We want to be able to amend our data pipelines in an agile fashion as the data environment and needs of the business change. So, just like any other piece of working software, continuous delivery practices should be adopted to enable continuous updates of data pipelines in production. Adopting this mindset and these practices is essential to support continuous improvement and create feedback loops that rapidly expose problems and address user feedback.

Consider how you name and partition your data

Data pipelines are a mix of code and infrastructure that can become confusing as they grow if care is not taken with the naming. Pipelines will include at least a set of databases, tables, attributes, buckets, roles, etc., and they should be named in a consistent way to facilitate understanding and maintenance of the pipelines, as well as make the data meaningful to the end-users.

In many architectures, naming will directly affect how your data is partitioned, which in turn affects the speed of the search and retrieval of data. Consider what will be the most frequent queries when specifying bucket names, table partitions, shards, and so on.

Want to know more?

These guiding principles have been born out of our engineers and use each of their 10+ years of data engineering for end-to-end machine learning solutions. We are sure there are lots of other principles, so please do let us know of any approaches you have found effective in managing data pipelines. 

In our next blog post in this series we will start laying out some of the key practices of data pipelines.  Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.  

Contact us!

If you’d like us to share our experience of data pipelines with you, get in touch using the form below.

 

When Equal Experts started a project for a mass media and entertainment conglomerate in the US, its initial goal was clear.

To collect digital assets (videos), metadata from different external providers, aggregate them in a single repository, and map them to an industry standard unique identifier. This identifier is required for the assets to be available on Video On Demand platforms.

But as you probably gather from the title of this article, as the project evolved, so did the solution. Because, as the project evolved, it became necessary to collect more data related to the assets’ supply chain, and the work started to be more focused on data. With this direction towards data, the EE Data Studio performed a Data Health Check with the rest of the development team. And, to address the challenges uncovered by the health check, our customer increased the team to include data engineers permanently.

What is a Data Health Check – and why is it necessary?

Performed over the course of two weeks, a Data Health Check is an assessment using quality metrics focussed specifically around data. The resulting health check provides a list of recommendations and improvements which a client could use to evolve their data platform.

Identifying immediate areas of improvement

As a result of the initial health check, we uncovered business rules spread and replication across the views and reporting tool – meaning a low level of reliability and confidence in the analytics data. This included the following:

  • Some data was being collected manually, with CSVs imported directly into the production database
  • The analytics data was being generated by several materialised views which included untested business rules
  • The views were only being refreshed once per day, so the lag between production and analytics was 24 hours
  • The data analysis and reporting was spread across multiple reporting tools: Metabase, Tableau, MS Excel

The initial step was to automate all manual data ingestion processes. Shortly after, we replaced the materialised views with tables that are updated incrementally with just the new ingested data. 

This improvement alone reduced the data lag – from production/ingestions to analytics reports – from 24h to 2h.

Next, to increase the analytics uniformity, we recommended a single reporting tool, removing the need to have manual extractions of data between reporting tools. We chose Metabase because it provided multiple features which fitted the client’s needs:

  • Report creation using native SQL queries
  • Ease in creating multiple dashboards with multiple reports
  • Slack and email notifications
  • Allowance of dashboards to be embedded on external sites
  • Good user management

Work with Metabase, also identified that it could also be used as an anomaly detection tool. Through creating several reports, and using Slack and email notifications, we were able to implement anomaly detection for the most important business metrics.

Looking for more business benefits through the effective use of data

The result of these changes meant we were able to address the major challenges and increase the reliability of analytics. However, there was still room for improvement. 

Roughly a year later we proactively undertook a Data Health Check Revisit where senior management were given full site of the progress, and further areas for improvement. It was important to understand the business value the next level of changes could represent and the client understood and recognised the need for these new recommendations.

How a data pipeline tool helped to realise improvements

Up until this point, all ingestions and the associated scheduling were set up inside a single Clojure project, with the management being done via configuration files within that project. This limited the flexibility to change a scheduler or trigger an ingestion. This meant a new deployment was needed. 

A key focus on the next level improvements, was to address that limitation and be able to easily execute the following tasks:

  • Schedule data ingestions
  • Manually trigger data ingestions
  • Perform data backfills for specific ingestion
  • Easily see an overview of the status of all running ingestions

A data pipeline tool or orchestrator was the most suitable to realise these improvements. From the currently available solutions, we devised a shortlist for comparison:

Although not being a specific tool for data pipelines, we chose Argo Workflows. The standout reason was the fact that all the ingestion code is written in Clojure. To use any of the other tools would require us to migrate the code to Python. Argo, on the other hand, is a Kubernetes container orchestrator, so it is agnostic to the code that is running, it just runs containers.

Argo has a visual UI that allows users to see which workflows are running and its past runs:

It also allows us to see the details of each run, including the container logs:

The flexibility to manage the data pipelines provided by Argo Workflows, has been much appreciated by our client, especially as we work within different time zones.

Using dbt to handle business logic within data transformations

Also uncovered by the Data Health Check Revisit was the need for a better strategy for handling the business logic within analytics data transformations. To achieve this, we settled on a tool called dbt.

Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.

The dbt model abstraction, allowed us to rewrite the business logic, currently spread over multiple places. The dbt tool also allowed us to add documentation for each model/field and to generate a documentation website with a pleasant UI which provides information about each table/field as well as the relations between the models (lineage graph).

 

We further increased the value of this tool to the project by writing tests for the data models. Currently we deploy the data models using a deployment pipeline, as we use for the other software components, running the tests as a pipeline step.

 

The importance of a Data Studio Health Check for continuous improvement

Tools and tech apart, we feel it’s an important point that this approach of starting with a Data Studio Health Check, followed up with continuous improvement and a proactive mindset, has enabled us to evolve our client’s data strategy solution to provide much greater business benefit than ever before.

This new approach to do the ELT using Argo Workflows and dbt has allowed us to:

  • Centralize the analytics business logic
  • Enrich the data transformations with tests
  • Create documentation for data
  • Improve the data lag from 2h to 10 minutes

Next, we’ll be releasing a new blog post containing a github repository with an example of the referred data pipeline architecture.

Thank you Tiago Agostinho for pairing with me to write this article.  If you feel that your organisation could benefit from a Data Health Check, or would like more details on how dbt or Argo Workflows work, please get in touch using the form below.

 

The six main benefits of an effective data pipeline

When you think of the technology tools that power a successful business, a data pipeline isn’t always at the top of the list. Because, although most forward thinking companies now realise data is one of their most valuable assets, the importance of data engineering is often underestimated. 

Yet modern data pipelines enable your business to quickly and efficiently unlock the data within your organisation. They allow you to extract information from its source, transform it into a usable form, and load it into your systems where you can use it to make insightful decisions. Do it well and you will benefit from faster innovation, higher quality (with improved reliability), reduced costs, and happy people. Do it badly, and you could lose a great deal of money, miss vital information or gain completely incorrect information.

In this article we look at how a successful data pipeline can help your organisation, as we attempt to unpack and understand the benefits of data pipelines.

About this series

This is part two in our six part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Now we look at the six main benefits of an effective data pipeline. Before we get into the details, we just want to cover off what’s coming in the rest of the series. In part three we consider the ‘must have’ key principles of data pipeline projects, parts four and five cover the essential practices of a data pipeline. Finally, in part six we look at the many pitfalls you can encounter in a data pipeline project. 

The benefits of a great data pipeline

Simply speaking, a data pipeline is a series of steps that move raw data from a source to a destination. In the context of business intelligence, a source could be a transactional database. The destination is where the data is analysed for business insights. In this journey from the source to the destination, transformation logic is applied to data to make it ready for analysis. There are many benefits to this process, here are our top six.

1 – Replicable patterns
Understanding data processing as a network of pipelines creates a way of thinking that sees individual pipes as examples of patterns in a wider architecture, which can be reused and repurposed for new data flows.

2 – Faster timeline for integrating new data sources
Having a shared understanding and tools for how data should flow through analytics systems makes it easier to plan for the ingestion of new data sources, and reduces the time and cost for their integration.

3 – Confidence in data quality

Thinking of your data flows as pipelines that need to be monitored and also be meaningful to end users, improves the quality of the data and reduces the likelihood of breaks in the pipeline going undetected.

4 – Confidence in the security of the pipeline

Security is built in from the first pipeline by having repeatable patterns and a shared understanding of tools and architectures. Good security practices can be readily reused for new dataflows or data sources.

5 – Incremental build
Thinking about your dataflows as pipelines enables you to grow your dataflows incrementally. By starting with a small manageable slice from a data source to a user, you can start early and gain value quickly.

6 – Flexibility and agility
Pipelines provide a framework where you can respond flexibly to changes in the sources or your data users’ needs.
Designing extensible, modular, reusable Data Pipelines is a larger topic and very relevant in Data Engineering. In the next blog post in this series, we will outline the principles of data pipelines.  Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.  

Contact us!
If you’d like us to share our experience of data pipelines with you, get in touch using the form below.

 

It is common to hear that ‘data is the new oil,’ and whether you agree or not, there is certainly a lot of untapped value in much of the data that organisations hold.

Data is like oil in another way – it flows through pipelines. A data pipeline ensures the efficient flow of data from one location to the other. A good pipeline allows your organisation to integrate new data sources faster, provide patterns that you can replicate, gives you confidence in your data quality, and builds in security. But, data flow can be precarious and, when not given the correct attention, it can quickly overwhelm your organisation. Data can leak, become corrupted, and hit bottlenecks and, as the complexity of the requirements grow, and the number of data sources multiplies, these problems increase in scale and impact.

About this series

This is part one in our six part series on the data pipeline, taken from our latest playbook. Here we look at the very basics – what is a data pipeline and who is it used by? Before we get into the details, we just want to cover off what’s coming in the rest of the series. In part two, we look at the six main benefits of a good data pipeline, part three considers the ‘must have’ key principles of data pipeline projects, and parts four and five cover the essential practices of a data pipeline. Finally, in part six we look at the many pitfalls you can encounter in a data pipeline project. 

Why is a data pipeline critical to your organisation?

There is a lot of untapped value in the data that your organisation holds. Data that is critical if you take data analysis seriously. Put to good use, data can identify valuable business insights on your customers and your operations. However, to find these insights, the data has to be regularly, or even continuously, transported from the place where it is generated to a place where it can be analysed.

A data pipeline, consolidates data from all your disparate sources into one (or multiple) destinations, to enable quick data analysis. It also ensures consistent data quality, which is absolutely crucial for reliable business insights. 

So what is a data pipeline?

A data pipeline is a set of actions that ingest raw data from disparate sources and move the data to a destination for storage and analysis. We like to think of this transportation as a pipeline because data goes in at one end and comes out at another location (or several others). The volume and speed of the data are limited by the type of pipe you are using and pipes can leak – meaning you can lose data if you don’t take care of them.

The data engineers who create a pipeline are a critical service for any organisation. They create the architectures that allow the data to flow to the data scientists and business intelligence teams, who generate the insight that leads to business value.

A data pipeline is created for data analytics purposes and has:

Data sources – These can be internal or external and may be structured (e.g., the result of a database call), semi-structured (e.g., a CSV file or a Google Sheets file), or unstructured (e.g., text documents or images).

Ingestion process – This is the means by which data is moved from the source into the pipeline (e.g., API call, secure file transfer).

Transformations – In most cases, data needs to be transformed from the input format of the raw data, to the one in which it is stored. There may be several transformations in a pipeline.

Data quality/cleansing – Data is checked for quality at various points in the pipeline. Data quality will typically include at least validation of data types and format, as well as conforming with the master data.

Enrichment – Data items may be enriched by adding additional fields, such as reference data.

Storage – Data is stored at various points in the pipeline, usually at least the landing zone and a structured store (such as a data warehouse).

End users – more information on this is in the next section.

So, who uses a data pipeline?

We believe that, as in any software development project, a pipeline will only be successful if you understand the needs of the users. 

Not everyone uses data in the same way. For a data pipeline, the users are typically:

Business intelligence/management information analysts, who need data to create reports; 

Data scientists who need data to do an in-depth analysis of point problems or create algorithms for key business processes (we use ‘data scientist’ in the broadest sense, including credit risk analysts, website analytics experts, etc.)

Process owners, who need to monitor how their processes are performing and troubleshoot when there are problems.

Data users are skilled at visualising and telling stories with data, identifying patterns, or understanding significance in data. Often they have strong statistical or mathematical backgrounds. And, in most cases, they are accustomed to having data provided in a structured form – ideally denormalised – so that it is easy to understand the meaning of an individual row of data without the need to query separate tables or databases.

Is a data pipeline a platform?

Every organisation would benefit from a place where they can collect and analyse data from different parts of the business. Historically, this has often been met by a data platform, a centralised data store where useful data is collected and made available to approved people. 

But, whether they like it or not, most organisations are, in fact, a dynamic mesh of data connections which need to be continually maintained and updated. Following a single platform pattern often leads to a central data engineering team tasked with implementing data flows. 

The complexities of meeting everyone’s needs and ensuring appropriate information governance, as well as a lack of self-service, often make it hard to ingest new data sources. This can then lead to backlog buildup, frustrated data users, and frustrated data engineers. 

Thinking of these dataflows as a pipeline changes the mindset away from monolithic solutions, to a more decentralised way of thinking – understanding what pipes and data stores you need and implementing them the right way for that case whilst reusing where appropriate.

So now we have understood a little more about the data pipeline, what it is and how it works, we can start to understand the benefits and assess whether they align with your digital strategy.  We cover these in the next blog article, ‘What are the benefits of data pipelines?’

For more information on the data pipeline in general, take a look at our Data Pipeline Playbook.  And if you’d like us to share our experience of the data pipeline with you, get in touch using the form below.

 

 

If you’re a senior IT leader,  I’d like to make a prediction. You have faced a key data governance challenge at some time. Probably quite recently. In fact, there is a good chance that you’re facing one right now. I know this to be true, because clients approach us frequently with this exact issue. 

However, it’s not a single issue. In fact, over time we have come to realise that data is a slippery term that means different things for different people. Which is why we felt that deeper investigation into the subject was needed, to gain clarity and understanding around this overloaded term and to establish how we can talk to clients who see data governance as a challenge. 

So, what is data governance? And what motivates an organisation to be interested in it?

Through a series of surveys, discussions and our own experiences, we have come to the conclusion that client interest in data governance is motivated by the following wide range of reasons.

1. Data Security/Privacy

I want to be confident that I know the right measures are in place to secure my data assets and that we have the right protections in place.

2. Compliance – To meet industry requirements

I have specific regulations to meet (e.g. health, insurance, finance) such as:

  • Storage – I need to store specific data items for specified periods of time (or I can only store for specific periods of time).
  • Audit – I need to provide access to specified data for audit purposes.
  • Data lineage/traceability – I have to be able to show where my data came from or why a decision was reached.
  • Non-repudiation – I have to be able to demonstrate that the data has not been tampered with.

3. Data quality

My data is often of poor quality, it is missing data points, the values are often wrong, or out of date and now no-one trusts it. This is often seen in the context of central data teams charged with providing data to business functions such as operations, marketing etc. Sometimes data stewardship is mentioned as a means of addressing this.

4. Master/Reference Data Management

When I look at data about the same entities in different systems I get different answers.

5. Preparing my data for AI and automation

I am using machine learning and/or AI and I need to know why decisions are being made (as regulations around the use of AI and ML mature this is becoming more pressing – see for example https://ico.org.uk/for-organisations/guide-to-data-protection/key-data-protection-themes/explaining-decisions-made-with-ai/).

6. Data Access/Discovery

I want to make it easier for people to find data or re-use data – it’s difficult for our people to find and/or access data which would improve our business. I want to overcome my data silos. I want data consumers to be able to query data catalogues to find what they need.

7. Data Management

I want to know what data we have e.g. by compiling data dictionaries. I want more consistency about how we name data items. I want to employ schema management and versioning.

8. Data Strategy

I want to know what strategy I should take so my organisation can make better decisions using data. And how do I quantify the benefits?

9. Creating a data-driven organisation

I want to create an operating model so that my business can manage and gain value from its data.

I think it’s clear from this that there are many concerns covered by the term data governance. You probably recognise one, or maybe even several, as your own. So what do you need to do to overcome these? Well, now we understand the variety of concerns, we can start to address the approach to a solution. 

Understanding Lean Data Governance

Whilst it can be tempting for clients to look for an off-the-shelf solution to meet their needs, in reality, they are too varied to be met by a single product. Especially as many of the concerns are integral to the data architecture. Take data lineage and quality as examples that need to be considered as you implement your data pipelines – you can’t easily bolt them on as an afterthought.

Here at Equal Experts, we advocate taking a lean approach to data governance – identify what you are trying to achieve and implement the measures needed to meet them. 

The truth is, a large proportion of the concerns raised above can be met by following good practices when constructing and operating data architectures – the sorts of practices that are outlined in our Data Pipeline and Secure Delivery playbooks.  

We have found that good data governance emerges by applying these practices as part of delivery. For example:

  • Most Data security concerns can be met by proven approaches – taking care during environment provisioning, implementing role-based access control, implementing access monitoring and alerts and following the principles that security is continuous and collaborative.
  • Many Data Quality issues can be addressed by implementing the right measures in your data pipelines – incorporating observability through the pipelines – enabling you to detect when changes happen in data flows; and/or pragmatically applying master and reference data so that there is consistency in data outputs. 
  • Challenges with data access and overcoming data silos are improved by constructing data pipelines with an architecture that supports wider access. For example our reference architecture includes data warehouses for storing curated data as well as landing zones which can be opened up to enable self-service for power data users. Many data warehouses include data cataloguing or data discovery tools to improve sharing.
  • Compliance challenges are often primarily about data access and security (which we have just addressed above) or data retention which depends on your pipelines. 

Of course, it is important that implementing these practices is given sufficient priority during the delivery. And it is critical that product owners and delivery leads ensure that they remain in focus. The tasks that lead to good Data Governance can get lost when faced with excessive demands for additional user features. In our experience this is a mistake, as deprioritising governance activities will lead to drops in data quality, resulting in a loss of trust in the data and in the end will significantly affect the user experience.

Is Data Governance the same as Information Governance?

Sometimes we also hear the term Information Governance. Information Governance usually refers to the legal framework around data. It defines what data needs to be protected and any processes (e.g. data audits), compliance activities or organisational structures that need to be in place. GDPR is an Information Government requirement – it specifies what everyone’s legal obligations are in respect of the data they hold, but it does not specify how to meet those obligations. Equal Experts does not create information governance policies, although we work with client information governance teams to design and implement the means to meet them.

The field of data governance is inherently complex, but I hope through this article you’ve been able to glean insights and understand some of the core tenets driving our approach. 

These insights and much more are in our Data Pipeline and Secure Delivery playbooks. And, of course, we are keen to hear what you think Data Governance means. So please feel free to get in touch with your questions, comments or additions on the form below.