What does good data governance look like – and how do we build it?

As we emerge from the pandemic, for many businesses the biggest concern isn’t being too bold – it’s being too cautious.

Business leaders are looking to accelerate transformation and deliver ambitious new services that are invariably delivered through technology. IT leaders are in the hot seat, and that’s a worry if you’re not 100% confident in your data.

Can you guarantee that data quality meets requirements? Do you have the systems and skills to integrate data from multiple platforms, silos and applications? Can you track where data comes from, and how it is processed at each stage of the journey?

If not, you’ve got a data governance problem.  

Without strong, high-quality governance, organisations are at the mercy of inaccurate, insufficient and out-of-date information. That puts you at risk of making poor decisions that lead to lost business opportunities, reputational damage and reduced profits – and that’s just for starters.

What does high-quality data governance look like?

It’s likely that the IT department will own data governance, but the strategy must be mapped to wider business goals and priorities.

As a rough guideline, here are 10 key things that we think must be a part of an effective data governance strategy:

  1.     Data security/privacy: do we have the right measures in place to secure data assets?
  2.     Compliance: are we meeting industry and statutory requirements in areas such as storage, audit, data lineage and non-repudiation.
  3.     Data quality: do we have a system in place to identify data that is poor quality, such as missing data points, incorrect values or out-of-date information? Is such information corrected efficiently, to maintain trust in our data? 
  4.     Master/Reference data management: If I look at data in different systems, do I see different answers?
  5.     Readiness for AI/automation: If we are using machine learning or AI, do I know why decisions are being made (in line with regulations around AI/ML)
  6.     Data access/discovery: Are we making it easier for people to find and reuse data? Can data consumers query data catalogues to find information, or do we need to find ways to make this easier?
  7.     Data management: Do we have a clear overview of the data assets we have? This might require the creation of data dictionaries and schema that allow for consistent naming of data items and versioning.
  8.     Data strategy: What business and transformation strategy does our data support? How does this impact the sort of decisions we make?
  9.     Do we need to create an operating model so the business can manage – and gain value from – this data?

Moving from data policy to data governance

As we can see, data governance is about more than simply having an IT policy that covers the collection, storage and retention of data. Effective, high-level data governance needs to ensure that data is supporting the broader business strategy and can be accessed and relied upon to support timely and accurate decision-making.

So how do IT leaders start to move away from the first view of governance to the latter? `

While it can be tempting for organisations to buy an off-the-shelf solution for data governance, it’s unlikely to meet your needs, and may not align with your strategic goals.

Understanding your strategy first means the business can partner with IT to identify the architecture changes that might be needed, and then identify solutions that will meet these needs.

Understanding Lean Data Governance

Here at Equal Experts, we advocate taking a lean approach to data governance – identify what you are trying to achieve and implement the measures needed to meet them.

The truth is that a large proportion of the concerns raised above can be met by following good practices when constructing and operating data architectures. You’ll find more information about best practices in our Data Pipeline and Secure Delivery playbooks.

The quality of data governance can be improved by applying these practices. For example:

  • It’s possible to address data security concerns using proven approaches such as careful environment provisioning, role-based access control and access monitoring.
  • Many data quality issues can be resolved by implementing the correct measures in data pipelines, such as incorporating observability so that you can see if changes happen in data flows, and pragmatically applying master and reference data so that there is consistency in data outputs.
  • To improve data access and overcome data silos, organisations should construct data pipelines with an architecture that supports wider access.
  • Compliance issues are often related to data access and security, or data retention. Good implementation in these areas makes achieving compliance much more straightforward.

The field of data governance is inherently complex, but I hope through this article you’ve been able to glean insights and understand some of the core tenets driving our approach.

These insights and much more are in our Data Pipeline and Secure Delivery playbooks. And, of course, we are keen to hear what you think Data Governance means to your organisation. So please feel free to get in touch with your questions, comments or additions on the form below.

Based on the experience shared in evolving a client’s data architecture, we decided to share a reference implementation of data pipelines. Recalling from the data pipeline playbook.

What is a Data Pipeline? 

From the EE Data Pipeline playbook:

A Data Pipeline is created for data analytics purposes and has:

  • Data sources – these can be internal or external and may be structured (e.g. the result of a database call), semi-structured (e.g. a CSV file or a Google Sheet), or unstructured (e.g. text documents or images).
  • Ingestion process – the means by which data is moved from the source into the pipeline (e.g. API call, secure file transfer).
  • Transformations – in most cases data needs to be transformed from the input format of the raw data, to the one in which it is stored. There may be several transformations in a pipeline.
  • Data Quality/Cleansing – data is checked for quality at various points in the pipeline. Data quality will typically include at least validation of data types and format, as well as conforming against master data. 
  • Enrichment – data items may be enriched by adding additional fields such as reference data.
  • Storage – data is stored at various points in the pipeline. Usually at least the landing zone and a structured store such as a data warehouse.

Functional requirements

  • Pipelines that are:
    • Easy to orchestrate
    • Support scheduling 
    • Support backfilling
    • Support testing on all the steps
    • Easy to integrate with custom APIs as sources of data
    • Easy to integrate in a CI/CD environment
  • The code can be developed in multiple languages to fit each client skill set when python is not a first class citizen. 

Our strategy 

In some situations a tool like Matillion, Stitchdata or Fivetran can be the best approach, although it’s not the best choice for all of our client’s use cases. These ETL tools work well when using the existing pre-made connectors, although when the majority of the data integrations are custom connectors, it’s certainly not the best approach. Apart from the known cost, there is also an extra cost when using these kinds of tools – the effort to make the data pipelines working in a CI/CD environment. Also, at Equal Experts, we advocate we should test each step of the pipeline, and if possible, develop them using test driven development – and this is near impossible in these cases.

That being said, for the cases when an ETL tool won’t fit our needs, we identified the need of having a reference implementation that we can use for different clients. Since the skill set of each team is different, and sometimes Python is not an acquired skill, it was decided not to use the well known python tools that are used these days for data pipelines like  Apache Airflow or Dagster. 

So we designed a solution using Argo Workflows as the orchestrator. We wanted something which allowed us to define the data pipelines as DAGs like Airflow. 

Argo Workflows is a container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo represents workflows as Dags (Directed Acyclic Graphs), and each step of the workflow is a container. Since data pipelines can be easily modeled as workflow it is a great tool to use. Also, we have freedom to choose which programming language to design the connectors or the transformations, the only requirement is that each step of the pipeline should be containerised.

For the data transformations, we found that dbt was our best choice. Dbt allows the transformations needed between the staging tables and the analytics tables. Dbt is SQL centric, so there isn’t a need to learn another language. Also, dbt has features that we wanted like testing and documentation generation and has native connections to Snowflake, BigQuery, Redshift and Postgres data warehouses. 

With these two tools, that is how we ended up with a language agnostic data pipelines architecture that can be easily reused and adapted in multiple cases and for different clients.

Reference implementation

Because we value knowledge sharing, we have created a public reference implementation of this architecture in the github repo which shows a pipeline for a simple use case of ingesting UK COVID-19 data (https://api.coronavirus.data.gov.uk) as an example.

The goal of the project is to have a simple implementation that can be used as an accelerator to other teams. It can be easily adapted to make other data pipelines, to integrate in a CI/CD environment, or to extend the approach and make it work for different scenarios. 

The sample project uses a local kubernetes cluster to deploy Argo and the containers which represent the data pipeline. Also a database where COVID-19 data is loaded and transformed and an instance of Metabase to show the data in a friendly dashboard.

We’re planning to add into the reference implementation infrastructure as code to deploy the project on AWS and GCP. Also, we might also work in aspects like facilitating the monitoring of the data pipelines when deployed in a cloud, or using Great Expectations.

Transparency is at the heart of our values

We value knowledge sharing and collaboration, so we hope that this article, along with the data pipelines playbook will help you to start creating data pipelines in whichever language you choose. 

Contact us!

For more information on data pipelines in general, take a look at our Data Pipeline Playbook.  And if you’d like us to share our experience of data pipelines with you, get in touch using the form below.

 

If you’re a senior IT leader,  I’d like to make a prediction. You have faced a key data governance challenge at some time. Probably quite recently. In fact, there is a good chance that you’re facing one right now. I know this to be true, because clients approach us frequently with this exact issue. 

However, it’s not a single issue. In fact, over time we have come to realise that data is a slippery term that means different things for different people. Which is why we felt that deeper investigation into the subject was needed, to gain clarity and understanding around this overloaded term and to establish how we can talk to clients who see data governance as a challenge. 

So, what is data governance? And what motivates an organisation to be interested in it?

Through a series of surveys, discussions and our own experiences, we have come to the conclusion that client interest in data governance is motivated by the following wide range of reasons.

1. Data Security/Privacy

I want to be confident that I know the right measures are in place to secure my data assets and that we have the right protections in place.

2. Compliance – To meet industry requirements

I have specific regulations to meet (e.g. health, insurance, finance) such as:

  • Storage – I need to store specific data items for specified periods of time (or I can only store for specific periods of time).
  • Audit – I need to provide access to specified data for audit purposes.
  • Data lineage/traceability – I have to be able to show where my data came from or why a decision was reached.
  • Non-repudiation – I have to be able to demonstrate that the data has not been tampered with.

3. Data quality

My data is often of poor quality, it is missing data points, the values are often wrong, or out of date and now no-one trusts it. This is often seen in the context of central data teams charged with providing data to business functions such as operations, marketing etc. Sometimes data stewardship is mentioned as a means of addressing this.

4. Master/Reference Data Management

When I look at data about the same entities in different systems I get different answers.

5. Preparing my data for AI and automation

I am using machine learning and/or AI and I need to know why decisions are being made (as regulations around the use of AI and ML mature this is becoming more pressing – see for example https://ico.org.uk/for-organisations/guide-to-data-protection/key-data-protection-themes/explaining-decisions-made-with-ai/).

6. Data Access/Discovery

I want to make it easier for people to find data or re-use data – it’s difficult for our people to find and/or access data which would improve our business. I want to overcome my data silos. I want data consumers to be able to query data catalogues to find what they need.

7. Data Management

I want to know what data we have e.g. by compiling data dictionaries. I want more consistency about how we name data items. I want to employ schema management and versioning.

8. Data Strategy

I want to know what strategy I should take so my organisation can make better decisions using data. And how do I quantify the benefits?

9. Creating a data-driven organisation

I want to create an operating model so that my business can manage and gain value from its data.

I think it’s clear from this that there are many concerns covered by the term data governance. You probably recognise one, or maybe even several, as your own. So what do you need to do to overcome these? Well, now we understand the variety of concerns, we can start to address the approach to a solution. 

Understanding Lean Data Governance

Whilst it can be tempting for clients to look for an off-the-shelf solution to meet their needs, in reality, they are too varied to be met by a single product. Especially as many of the concerns are integral to the data architecture. Take data lineage and quality as examples that need to be considered as you implement your data pipelines – you can’t easily bolt them on as an afterthought.

Here at Equal Experts, we advocate taking a lean approach to data governance – identify what you are trying to achieve and implement the measures needed to meet them. 

The truth is, a large proportion of the concerns raised above can be met by following good practices when constructing and operating data architectures – the sorts of practices that are outlined in our Data Pipeline and Secure Delivery playbooks.  

We have found that good data governance emerges by applying these practices as part of delivery. For example:

  • Most Data security concerns can be met by proven approaches – taking care during environment provisioning, implementing role-based access control, implementing access monitoring and alerts and following the principles that security is continuous and collaborative.
  • Many Data Quality issues can be addressed by implementing the right measures in your data pipelines – incorporating observability through the pipelines – enabling you to detect when changes happen in data flows; and/or pragmatically applying master and reference data so that there is consistency in data outputs. 
  • Challenges with data access and overcoming data silos are improved by constructing data pipelines with an architecture that supports wider access. For example our reference architecture includes data warehouses for storing curated data as well as landing zones which can be opened up to enable self-service for power data users. Many data warehouses include data cataloguing or data discovery tools to improve sharing.
  • Compliance challenges are often primarily about data access and security (which we have just addressed above) or data retention which depends on your pipelines. 

Of course, it is important that implementing these practices is given sufficient priority during the delivery. And it is critical that product owners and delivery leads ensure that they remain in focus. The tasks that lead to good Data Governance can get lost when faced with excessive demands for additional user features. In our experience this is a mistake, as deprioritising governance activities will lead to drops in data quality, resulting in a loss of trust in the data and in the end will significantly affect the user experience.

Is Data Governance the same as Information Governance?

Sometimes we also hear the term Information Governance. Information Governance usually refers to the legal framework around data. It defines what data needs to be protected and any processes (e.g. data audits), compliance activities or organisational structures that need to be in place. GDPR is an Information Government requirement – it specifies what everyone’s legal obligations are in respect of the data they hold, but it does not specify how to meet those obligations. Equal Experts does not create information governance policies, although we work with client information governance teams to design and implement the means to meet them.

The field of data governance is inherently complex, but I hope through this article you’ve been able to glean insights and understand some of the core tenets driving our approach. 

These insights and much more are in our Data Pipeline and Secure Delivery playbooks. And, of course, we are keen to hear what you think Data Governance means. So please feel free to get in touch with your questions, comments or additions on the form below.