Challenges in working with low-code tools in data engineering projects

This article is to share my experience and challenges of working as a data SDET on a data engineering project using low-code tools. 

This article has been written in close collaboration with Simon Case – Head of Data at Equal Experts.

Introduction on low code tools:

A lot of data engineering work  involves extracting data from operational systems (on-premise databases, workflow systems etc.), and ingesting it into an environment – such as a data warehouse – where it can be stored and queried without negatively  impacting the operational services.

There are many ways to do this, and one of the challenges for a data engineering team is to decide which is the right one for the technical and business landscape they are working in. At Equal Experts we have a strong preference for approaches which can fit into a continuous delivery methodology  – continuous delivery has been shown to be the most efficient way of delivering software and we have found that software which is focused on data is no different. This requires  a code based solution – with everything stored in a  repo.

However, there are many tools which promise to streamline this process through the use of low-code or no-code technologies. These:

  • Use a visual development approach – solutions are specified with drag and drop graphical user interfaces.
  • Are aimed at  allowing users with little or no software development experience to create solutions.
  • Often have large  lists of connectors to connect to different external third party sources which enable the development team and data savvy users to quickly orchestrate data and do experimentation with it.

Recently we worked  with a client who had adopted a low-code tool for data, which enabled the specification of ETL/ELT data pipelines through the use of a visual programming, drag and drop interface. This blog captures some of our experiences of working with these tools. Whilst you cannot reach the same levels of automation as with a purely code-based solution, we found some ways of improving the reliability of deployments.

What are the main challenges of a low-code/no-code solution?

From the perspective of developers who follow continuous delivery practices, we found there were a number of things that prevented us from creating a development process as we would have liked:

  • Immature Git integration – the integration of low code tools with Git for version control is still in an infancy stage. Whilst many offer some ability to commit changes to a git repo, there is no easy way to do a git diff between branches or between commits within a feature branch to highlight the differences and to resolve the merge conflicts. Teams have to be careful in their branching and merging strategy to avoid the merge conflicts by allocating work to different functionality areas so that developers do not trip up on each other’s work by working on the same job.
  • No unit-testing framework – in a continuous delivery approach, continuous testing is critical. Having small tests associated with each change means developers can check that later changes to the code won’t break the solution, without the need to run end to end tests. This hugely improves feedback loops by reducing the time to detect errors and also improves the ability to deliver features. In low or no-code solutions there is no easy way to create small scale simple tests that check small pieces of functionality.
  • Manual promotion – because unit-testing is not possible, promotion to new environments (e.g. from development to UAT) required manual steps and  it was not possible to create a true automated  CI/CD environment.
  • QA becomes a bottleneck – because we were not able to test in depth with unit tests, testing got shifted right to the QA team. Lots of manual tests were required  for each release. Fortunately there were willing data users in the company, who would test releases and include edge-cases.

The approaches we took to improve the ability to deliver new  data pipelines? 

We did find that there were things from the modern software development toolbox that we could apply, and that did improve the team’s ability to deliver a reliable and trustworthy solution:

  • Infrastructure as code – for all data pipelines, the use of IaC using tools like Terraform or Cloud formation is pretty much essential. Fortunately we found that we were able to use these with the low-code solution, and the orchestration solution could be hosted in EC2.
  • Version controlling the Database – like most data pipelines, a large part of the work was handled within the data warehouse (in our case Snowflake.) We applied Liquibase to control the structures of the databases. This turned out to be a life-saver. It made it simple to clone the database across the environments and saved us a lot of work.
  • Automating QA with SodaSQL – we used SodaSQL tool to create data quality tests which reduced some of the manual testing effort. We even managed to create tests to prevent regressions. It’s not as good as a true unit test framework, but it definitely improved the system stability and reduced the time to deliver data pipelines.

I hope you found these learnings useful. These insights and much more can be found in our data pipeline playbook.

If you want to talk to us about Data Engineering or our experiences in using low-code data tools please do get in touch.



Inspired by our previous post on unit testing dbt and discussion in the dbt forums, I looked into how to get dbt to help me test and deploy UDFs. Although dbt doesn’t directly support UDFs, it turns out it’s actually fairly straightforward to define, test and reference them in your models. I’ll be using BigQuery in this post, but I think the techniques will apply to other warehouses that support UDFs like Snowflake, Redshift and Spark, too.

What’s a UDF and what does it have to do with dbt?

A UDF is a User-Defined Function. Like any other type of function, they’re really helpful to capture and name chunks of logic that you can apply in queries, and I talked about them before, starting from Tip 4 in our Data Warehouse Tips post. dbt takes care of deploying the tables and views that you might use your UDFs in, so robustly referencing a function from your models is important. When I say ‘robustly’, I mean ensuring that the function exists and that you’re calling the function from the appropriate location – for example, avoiding hard-coding your production functions everywhere. There’s no native support for referencing a UDF like there is with `source` or `ref`, nor any native way to deploy a function in a project. This post explains how you can solve those problems.

The example

I have a BigQuery dataset that I use for my personal development work. This is referenced in my own dbt profile. When I run dbt commands on my machine, they take effect here.

There are test and production datasets that my CI system has profiles for. I cannot run ad-hoc dbt commands in these datasets. When I push my changes, the CI system will run any dbt commands necessary to deploy my updates and run tests. I’ll not be talking about that in this post, but it’s important that I’m referencing the UDF that is deployed in the same environment that I’m working in – otherwise I can’t iterate safely, relying on my tests.

The example UDF I’ll use is a simple predicate function, located in `my_schema`, from string to bool that returns true if and only if the string value can be interpreted as a positive integer:

CREATE OR REPLACE FUNCTION my_schema.is_positive_int(a_string STRING)

RETURNS BOOLEAN

AS (

  REGEXP_CONTAINS(a_string, r'^[0-9]+$')

);

Defining a UDF in dbt

We need to write a dbt “macro” – a Jinja template – that parameterises where the UDF should go. In `macros/is_positive_int_udf.sql`:

{% macro is_positive_int_udf() %}

CREATE OR REPLACE FUNCTION {{ target.schema }}.is_positive_int(a_string STRING)

RETURNS BOOLEAN

AS (

  REGEXP_CONTAINS(a_string, r'^[0-9]+$')

);

{% endmacro %}

Note the macro name matches the file and that `target.schema` is templated out.

To have dbt actually deploy this for us, we add a hook in our `dbt_project.yml`:

on-run-start:

- '{{ is_positive_int_udf() }}'

Now, when we `dbt run`, the UDF will be deployed.

11:11:52  Running 1 on-run-start hooks

11:11:52  1 of 1 START hook: my_project.on-run-start.0.................... [RUN]

11:11:52  1 of 1 OK hook: my_project.on-run-start.0....................... [OK in 0.49s]

Kind of yucky in the console output with just list indices to identify what’s what –  but not the end of the world.

Testing a UDF in dbt

Now our UDF is schema-aware and is being deployed by dbt, we can write tests for it with minimal ceremony. In `tests/is_positive_int.sql`:

WITH examples AS (

  SELECT 'foo' the_string, FALSE expected

  UNION ALL SELECT '1111', TRUE

  UNION ALL SELECT '5 not this', FALSE

  UNION ALL SELECT 'still not this 5', FALSE

  UNION ALL SELECT NULL, NULL

)


, test AS (

  SELECT

 *,

  {{ target.schema }}.is_positive_int(the_string) actual

  FROM examples

)



SELECT * FROM test WHERE TO_JSON_STRING(actual) != TO_JSON_STRING(expected)

Running `dbt test` runs the pre-hooks and then the tests, perfect.

...

11:25:22  Running 1 on-run-start hooks

11:25:22  1 of 1 START hook: my_project.on-run-start.0.................... [RUN]

11:25:23  1 of 1 OK hook: my_project.on-run-start.0....................... [OK in 0.58s]

...

11:25:24  1 of 1 START test is_positive_int............................... [RUN]

11:25:25  1 of 1 PASS is_positive_int..................................... [PASS in 1.02s]



11:25:26  Finished running 1 tests, 1 hooks in 1.46s.

Using the UDF in a Model

We’ve actually already seen how to do this, in the test.

In `models/my_model.sql`:

SELECT {{ target.schema }}.is_positive_int(maybe_positive_int_column)

FROM {{ ref(...) }}

Summary

There you have it. In summary, to incorporate UDFs neatly into your dbt models:

  • Write a macro for your UDF, templating the target schema
  • Add a call to macro to `on-run-start` in your `dbt_project.yaml`
  • Write some tests, of course
  • Use the UDF in your models, templating the target schema

Whatever your role, you might come into contact with a data warehouse at some point. They’re incredibly powerful tools that I’ve frequently seen ignored or used in ways that cause problems. In this post, I’ll share a few tips for handling data in your data warehouse effectively. Whilst the focus here is helping data scientists, these tips have helped other tech professionals answer their own data questions and avoid unnecessary work, too.

Gaps in traditional data science training

Traditional data science training focuses on academic techniques, not practicalities of dealing with data in the real world. The odds are you’ll learn five different ways to perform a clustering, but you’ll not write a test or a SQL query.

Fresh data scientists, landing in their first professional roles, can find that the reality of data science is not what they expected. They must deal with messy and unreliable terabyte-scale datasets that they can’t just load into a Pandas or R dataframe to clean and refine. Their code needs to run efficiently enough to be useful and cost-effective. It needs to run in environments other than their laptop, robustly interact with remote services and be maintained over time and by other people.

This mismatch between training, expectation and reality can cause real problems*. At one end of the scale, you get data scientists producing code that needs to be “productionised” (well, rewritten) – by “data engineers” to be fit for its real-world purpose. At the scarier end of the scale you get data science time investments that have to be thrown in the bin because a couple of early mistakes mean it just doesn’t work in practice.

(* this blog post isn’t about the other big expectation mismatch – how you’ll likely be spending a good chunk of your time doing pretty basic things like writing simple reports, not the cool data science stuff… sorry )

Data science training is adapting to make more room for practical skills. Here’s a few tips to avoid making mistakes I’ve helped fix for fresh and seasoned data scientists out there in the field today.

Use your data warehouse effectively

There’s a good chance that your data science is applied to data that came from a SQL-based analytics system like Google BigQuery, AWS Redshift, Azure Analytics, Snowflake or even Spark SQL. These systems are engineered to efficiently answer complex queries against big datasets, not to serve up large quantities of raw data. You might be able to:

SELECT

  *

FROM client_transactions

 

…streaming out fifty columns by ten million rows to find the top ten clients by spend in Pandas, but it’ll be an awful lot faster and more robust if you do more work in the data warehouse and retrieve just the ten rows and two columns you’re really interested in:

SELECT

  client_id,

  SUM(transaction_value) total_spend

FROM

  client_transactions

GROUP BY

  client_id

ORDER BY

  total_spend DESC

LIMIT 10

 

As the table you’re accessing grows, the benefits of doing the work in the data warehouse will too.

I recall working with a data scientist, fresh out of university, who did not want to learn SQL and took the former approach. It worked well enough while the product was in development and soft launch. Three months after public launch, they were trying to download tens of gigabytes of data from the data warehouse. We ran out of hardware and networking options to keep the complex analytics pipeline they had built on their laptop running and it ground to an undignified halt.

Often, some of the work has to be done in your code outside the data warehouse. These tips encourage you to do as much of the work as you can in your data warehouse before you pull just what you need out to process in your code. Modern data warehouses are capable of doing more right there in the warehouse, including building machine learning models.

We’ll focus on SQL for the next few tips as it’s so prevalent, but similar principles and capabilities are often available in whatever data storage and parallel processing capability you are using. The tips apply in principle to any query language, so you may be able to apply some of them when working with systems like Neo4J or MongoDB. I’ll typically refer to Google’s BigQuery data warehouse, as I find it has excellent SQL feature coverage and the documentation is clear and well-written.

Tip 1: Learn and use the query language

SQL has been around for a long time – even older than I have (just). The foundations were laid back in 1970. It’s an enormously expressive and powerful language. You can probably do more than you think with it, and making the effort to try and push as much work as you can into the database, building your SQL knowledge as you go, will pay dividends quickly. You get the added bonus that non-data scientists could understand and be able to help with your queries, as SQL is commonly used by other data folk that perform analytics and reporting functions.

There’s a lot more power and flexibility in modern SQL than the basic SELECTWHERE, and GROUP BY would suggest. Every database has a query optimiser that looks at your query and statistics about the tables involved before selecting a strategy to answer most efficiently. Self-joins, nested columns, STRUCT types, user-defined functions and window or analytic functions are a few capabilities common to many data warehouses that I’ve found very helpful in expressively solving trickier problems, and they are less-well known amongst data scientists.

I found the first couple of paragraphs in BigQuery’s window function documentation provided a really useful mental model of what they are, how they differ from normal aggregations and when they might be useful.

Once you start using SQL, you’ll find yourself writing larger and more complex queries. Let’s talk about how you can control that complexity and keep your queries manageable.

Tip 2: Use common table expressions

You can think of Common Table Expressions (CTEs) as virtual tables within your query. As your simple query becomes more complex, you might be tempted to bolt together multiple sub-queries like this:

SELECT a, b, c

FROM (SELECT ...)

INNER JOIN (SELECT ...)

 

This will work, and I’ve seen plenty of queries done this way. They can easily grow into SQL behemoths hundreds of lines long that are really difficult to understand and reason about. The sub-queries are a big part of the problem, nesting complex logic inside other queries. Instead of creating a large, monolithic query, you can pull out those complex queries into CTEs and give them meaningful names.

WITH

    clients AS (SELECT ....),

    transactions AS (SELECT …)

SELECT a, b, c

FROM clients

INNER JOIN transactions

 

Now you can see what that query is all about. As well as making your queries easier to understand, breaking up a large query like this means you can query the CTEs to see what they are producing while you develop and debug your query. Here is a more in-depth tutorial on CTEs.

Equal Experts Associate Dawid du Toit highlights one thing to watch out for when using CTEs. If you use the CTE multiple times during your query (for example, a self-join from the CTE back onto itself) you may find the CTE evaluated twice. In this, materializing the data into a temporary table may be a better option if the cost of the query is a significant concern.

Tip 3: Use views

So you’ve started using CTEs and you’re writing much more expressive queries. Things can still become complicated, and it’s not always convenient to change your query to inspect your CTEs – for example, when your query is running out there in production and can’t be modified.

Once you’ve got a CTE, it’s straightforward to convert into a database view. A view is just a query you define in the database, so you can just take your CTE and wrap it in a CREATE VIEW statement. You can then use the view as if it were a table – which means you can query it and inspect the data it contains at any time without modifying your deployed queries.

Views are usually heavily optimised and don’t have to introduce any performance or cost penalty. For example, implementing “predicate pushdown” allows a data warehouse to optimise the data and computation the view uses based on how it is queried – as if the view was just part of the query.

Views are logical infrastructure, and lend themselves to deployment with an “infrastructure as code” mindset the same way as we might deploy a server or the database itself.

CREATE VIEW clients AS (SELECT ....);

CREATE VIEW transactions AS (SELECT …);

 

Then, we can use these views in queries, whether executed ad-hoc or as part of a data pipeline.

SELECT a, b, c

FROM clients

INNER JOIN transactions

 

Database views provide a much more flexible interface to underlying tables. The underlying data will eventually need to change. If consumers access it through views, the views can adapt to protect those consumers from the change, avoiding broken queries and irate consumers.

Views are also a great way to avoid data engineering complexity. I’ve seen plenty of complex data processing pipelines that populate tables with refined results from other tables, ready for further processing. In every case I can recall, a view is a viable alternative approach, and it avoids managing the data in intermediate tables. No complex data synchronisation like backfilling, scheduling or merging challenges here! If your data warehouse supports it, a materialized view can work where there are genuine reasons, such as performance or cost, to avoid directly querying the source table.

Tip 4: Use user defined functions

Databases have supported User Defined Functions (UDFs) for a long time. This useful capability allows you to capture logic right in the database, give it a meaningful name, and reuse it in different queries. A simple example of a UDF might parse a list from a comma-separated string, and return the first element. For example, given a function like this (using the BigQuery SQL dialect)

CREATE FUNCTION my_dataset.get_first_post(post_list STRING)

RETURNS STRING

AS (

  SPLIT(post_list, ',')[SAFE_OFFSET(0)]

)

 

Specify the return type of your query, even if it can be inferred. That way, you’ll know if you’re returning something totally different to what you expected.

We can use the function in other queries:

WITH post_lists AS (

  SELECT ‘a,b,c,d’ AS post_list

)

SELECT

  post_list,

  my_dataset.get_first_post(post_list) first_post

FROM post_lists

 

The result would be as follows:

post_list first_post
a,b,c,d a

 

The benefits might not be clear from this simple function, but I recently implemented a Levenshtein Distance function using BigQuery’s JavaScript UDF capability, allowing us to find likely typos in a large dataset in the data warehouse.

Dawid recalls seeing UDFs that take a parameter which then go and query another table. This can completely kill your query performance, for example, where you could just have just joined with the other table. Be wary of using UDFs in this way!

A benefit that might not be immediately obvious is that you can deploy a fix or improvement to your UDF and all queries using it pick up the change when they are next executed. Be aware that there may be little of the structural flexibility you might be used to in other programming languages. For example, you might not be able to “overload” a UDF to handle an additional argument, nor can you provide a default value to an argument.

These limitations can make handling change even harder than in other languages – but I’ve found it’s still a big improvement over copying and pasting.

Tip 5: Understand what’s efficient

Doing your number-crunching in your data warehouse is likely to be the most cost-efficient place to do it. After all, modern data warehouses are really just massively parallel compute clusters.

There are typically ways to make your processing more efficient – which always means cheaper and usually means faster. For example, a clickstream table might stretch back five years and be 150Tb in total size. Partitioning that table by the event date means that the data warehouse can just scan yesterday’s data rather than the whole table to give you a total of some metric for yesterday – so long as you construct your query to take advantage of it. In this case, it’s likely as straightforward as ensuring your query WHERE clause contains a simple date constraint on the partition column.

Understanding how your data warehouse is optimised (or how to optimise it, if you’re setting up the table), can make a real difference to your query performance and cost.

Writing better SQL, faster

As you gain competency in writing SQL (or any language), you’ll find your queries and programs becoming more complex. That will make tracking down problems more difficult. Here’s a few tips to help you find and squash your bugs more easily, be more confident that your code does what it’s supposed to, and avoid the problems in the first place!

Tip 6: Test your UDFs

You can quite easily test your UDFs. Here’s a cut-down version of the query I wrote to test that Levenshtein distance function I mentioned earlier:

--- create a virtual table of function inputs and expected outputs, via a CTE

WITH examples AS (

    SELECT 'kittens' AS term_1, 'mittens' AS term_2, 1 AS expected

    UNION ALL SELECT 'kittens', 'kittens', 0

)




--- create a new CTE that injects the actual outputs from the function

, test AS (

    SELECT

        *,

        my_dataset.levenshtein_distance(term_1, term_2) actual

    FROM examples

)




SELECT

    *,

    IF(TO_JSON_STRING(actual) = TO_JSON_STRING(expected), 'pass', 'fail') result

FROM tes


Results:

term_1 term_2 expected actual result
kittens mittens 1 1 pass
kittens kittens 0 0 pass

 

If the query produces any rows that do not contain ‘pass’ in the result column, we know that there was an issue – that the actual function output did not match our expectation.

Although not needed for these examples, I typically use a function like TO_JSON_STRING that “stringifies” the expected and actual content in a deterministic manner so that structured values and NULL are compared correctly. Be aware that the documentation for TO_JSON_STRING does not guarantee deterministic order of keys in STRUCTs, but I’ve seen no problems in practice.

Tests have one more huge benefit that might not be immediately obvious. When you do data science for an organisation, it’s likely that someone else will one day need to understand your work. They might even need to make changes. Having some runnable tests means that both of those things will be easier, and that person will be much more likely to think fondly of you.

Tip 7: Test your views

Your views will depend on tables, and as you use these techniques you’ll find that you end with views that depend on other views. It’s handy to know that your views actually do what they are supposed to. You can test views in a similar fashion to UDFs, but it is a little more involved because unlike a UDF, a view must explicitly refer to the tables or views that provide the data it uses.

I have rolled my own testing solutions for these networks of views before. You can create test data, for example in the form of CSV files, that you upload to temporary tables. You can then create a temporary version of your views that use these tables with known, fixed content. The same testing approach outlined for UDFs can now check that your views produce the right data, for the same benefits!

A templating engine like Jinja templates and envsubst can provide a relatively straightforward way to change values in queries and views that vary between environments using environment variables. A query containing {{my_dataset}}.table can be adjusted to test.my_table in the test environment and prod.my_table in production.

Tip 8: Use dbt

I’ll also mention some tooling here. We’ve found dbt quite effective for managing networks of views, like those described above. It directly supports flexible references for source views and tables, provides built-in data and view testing capabilities and can even generate documentation, as in this example. Dataform is an alternative providing similar capabilities. Given that it’s recently been acquired by Google, it’ll likely be making an appearance as part of GCP soon, (thanks to Equal Experts Associate Gareth Rowlands for pointing this out!)

Data Studio team member Cláudio Diniz has also been using dbt. You can find more details in his post and the related example Github repo

Tip 9: Your time is valuable

Organisations often promote directives to save costs wherever possible. It’s not the first time I’ve come across posters, screens and mailshots for employees that remind staff to think about how much things cost and save every penny they can. In principle, that’s a great directive. Unfortunately, it can create focus on small costs that are easy to quantify at the expense of larger costs that are more obscure.

It’s great to create focus on a query that would process 10Tb but could execute for 1/1000th that cost with the addition of a WHERE clause that takes advantage of partitioning. Why not save easy money?

On the other hand, I’ve seen occasions when hours or days of work are invested to make a query more efficient, perhaps saving 2p per day at the expense of – let’s be conservative – two days of a data scientists’ time. At an average salary of £50k, that means we’re looking at a £400 investment to save £7.30 per year – and we’re ignoring the opportunity cost that your time could be spent working on something else.

This is a naive estimate. How much you cost your company is complicated. I’m using the gross salary as an estimate of employee cost. I think the true cost of a typical employee to a typical company is still likely to be higher than the gross salary after factors like tax breaks and equipment costs are taken into account.

Don’t forget to consider what your time is worth!

Bonus Tip: Google it!

The last tip I’d share with you is to check for existing solutions before you build it yourself. The odds are that the problem you are thinking about isn’t novel – others have likely come across that problem before. If you’re lucky, you’ll be able to understand and use their solution directly. If not, you’ll learn about someone else’s experience and you can use that to inform your solution.

Reading about a solution that didn’t work is a lot quicker than inventing it and discovering that it doesn’t work yourself.

The end…?

If you got this far, thanks for reading this post! I hope some of the tips here will be useful.

There’s more to productive data science than effective use of your data warehouse, so watch this space for more hints and tips in future.

Knowing, understanding and managing your data throughout its lifecycle is more important than it has ever been. And more difficult. 

Of course, the never ending growth in data volume is partly responsible for this, as are also countless processes that need to be applied to the data to ensure it is usable and effective. Which is why data analysts and data engineers turn to data pipelining.

Added complexity is involved when, In order to keep abreast of the latest requirements, organisations need to constantly deploy new data technologies alongside legacy infrastructure. 

All three of these elements mean that, inevitably, data pipelines are becoming more complicated as they grow. In the final article in our data pipeline series, we have highlighted some of the common pitfalls that we have learned from our experience over the years and how to avoid them. These are also part of our Data Pipeline Playbook.

About this series

This is the final post in our six part series on the data pipeline, taken from our latest playbook. Now we look at the many pitfalls you can encounter in a data pipeline project. In the series before now, we looked at what a data pipeline is and who it is used by. Next we looked at the six main benefits of a good data pipeline, part three considered the ‘must have’ key principles of data pipeline projects, and part four and five covered the essential practices of a data pipeline. So here’s our list of some of the pitfalls we’ve experienced when building data pipelines in partnership with various clients. We’d encourage you to avoid the scenarios listed below.

Avoid tightly coupling your analytics pipelines with other business processes

Analytics data pipelines provide data to produce insights about your customers, business operations, technology performance, and more. For example, the role of a data warehouse is to create an historical record of data that can be mined for insights.

It is tempting to see these rich data sources as the best source of data for all data processing and plumb key business activities in these repositories. However, this can easily end up preventing the extraction of insights it was implemented for. Data warehouses can become so integrated into business operations – effectively acting as the Operational Data Store (ODS) – that they can no longer function as a data warehouse. Key business activities end up dependent on the fast processing of data drawn from the data warehouse, which prevents other users from running queries on the data they need for their analyses.

Modern architectures utilise a micro-service architecture, and we advocate this digital platform approach to delivering IT functionality (see our Digital Platform Playbook). Micro-services should own their own data – and as there is unlikely to be a one-size-fits-all solution to volumes, latencies, or use of master or reference data of the many critical business data flows implemented as micro-services. Great care should be taken as to which part of the analytics data pipelines they should be drawn from. The nearer the data they use is to the end users, the more constrained your data analytics pipeline will become over time, and the more restricted analytics users will become in what they can do.

If a micro-service is using a whole pipeline as part of its critical functionality, it is probably time to reproduce the pipeline as a micro-service in its own right, as the needs of the analytics users and the micro-service will diverge over time.

Include data users early on

We are sometimes asked if we can implement data pipelines without bothering data users. They are often very busy interfacing at senior levels, and as their work provides key inputs to critical business activities and decisions, it can be tempting to reduce the burden on them and think that you already understand their needs.

In our experience this is nearly always a mistake. Like any software development, understanding user needs as early as you can, and validating that understanding through the development, is much more likely to lead to a valued product. Data users almost always welcome a chance to talk about what data they want, what form they want it in, and how they want to access it. When it becomes available, they may well need some coaching on how to access it.

Keep unstructured raw inputs separate from processed data

In pipelines where the raw data is unstructured (e.g. documents or images), and the initial stages of the pipeline extract data from it, such as entities (names, dates, phone numbers, etc.), or categorisations, it can be tempting to keep the raw data together with the extracted information. This is usually a mistake. Unstructured data is always of a much higher volume, and keeping it together with extracted data will almost certainly lead to difficulties in processing or searching the useful, structured data later on. Keep the unstructured data in separate storage (e.g., different buckets), and store links to it instead.

We hope that this article, along with all the others in the series, will help you create better pipelines and address the common challenges that can occur when building and using them. Data pipeline projects can be challenging and complicated, but done correctly they securely gather information and allow you to make valuable decisions quickly and effectively. 

Contact us!

For more information on data pipelines in general, take a look at our Data Pipeline Playbook.  And if you’d like us to share our experience of data pipelines with you, get in touch using the form below.

In my opinion, ‘data is the new oil’ is a metaphor that should be used with caution, especially by those who wish to portray data in a positive light.

That is, whilst there are many similarities between data and oil, most are unflattering. I believe that by confronting these negative connotations, we can have the right conversations about our responsibilities in the Age of Data, whilst finding better metaphors to describe them.

A Brief History 

In the beginning was Clive Humby, the British data scientist, who coined the phrase ‘data is the new oil’ back in 2006. It has since become part of the business and management lexicon, repeated by journalists, policy makers and world leaders alike. In common usage, the metaphor emphasises the fact that oil and data are critical parts of the modern global economy, with the latter gradually replacing the former. Humby also recognised that data, like oil, has no intrinsic value, and expensive processes of refinement have to be applied before they become valuable. 

Certainly, data powers much of the economy, just as oil powers our engines. Much of what we do online is part of a Faustian pact, in which we allow the tech giants to harvest our data in exchange for useful, free tools such as email. Tech evangelists minimise the costs whilst emphasising the benefits of data in our lives. But if we ever stopped to actually consider how much personal information we give away each day we’d put our laptops in the freezer. And comparing data to oil has a dark side. Oil is a dirty business. Oil-based products – petrol, plastics, chemicals – are harming the planet. Put simply, this isn’t the kind of company data should want to keep.

Oil Spills and Data Leaks

Let’s look at one of the most regrettable similarities between data and oil.

As oil moves around the globe, leaks happen (there have been 466 large oil spills in the last 50 years). Much has been said and written about these disasters and the environmental damage they cause. Coupled with the growing apprehension of the role oil plays in the global climate crisis, you might expect the demand for oil to be falling like a stone. You’d be wrong. 

Figure 1: Global Oil Production 1999-2020 

And if that graph surprises you, consider this:

Figure 2: Number of monthly active Facebook users worldwide 

If you’re looking for indicators of decline following Facebook’s equivalent to Exxon Valdez – the Cambridge Analytica scandal – you won’t find it. Lest we forget, Cambridge Analytica harvested upwards of 87 million Facebook users’ personal data without their consent, then sold that data to political consultancies. This dubious practice may well have affected the outcome of the 2016 US Presidential election, and the Brexit vote in the UK the same year. But despite #deletefacebook and some social and political huffing at the time, the scandal didn’t make a dent on Facebook’s fortunes.

So tech giants and oil barons are alike, in that they leak and pollute and behave with disregard for the wider community, without much consequence. 

Oil and Water

The question then becomes, is there a better metaphor out there? During my research I’ve happened across plausible arguments in favour of a cataclysmic comparison – that is to say, data is the new nuclear power (awesomely powerful, yet capable of dreadful contamination and destruction). When discussing this piece with a leading practitioner, he reminded me that data ‘flows’ from one place to another, and suggested that it’s like water (it’s nourishing and necessary – but needs filtering and processing to be safe; it can leak), or slightly less appetisingly, data is like blood.

All are decent metaphors (I particularly like the ‘water’ alternative). However, water (like uranium, or blood) is physical – if I buy and drink a litre of water, no one else can drink that same litre – whereas data can be used simultaneously in different places, at multiple times in multiple ways. And data is unique, whereas one glass of water is essentially the same as any other. 

If we stick with data being like oil, we’re left with harrowing images of sick seabirds and bleached reefs. Which prompts me to ask: are we in danger of losing something valuable, by tarring data with the same oily brush?

Data for Good

Last year, academics at the University of Oxford interrogated a massive dataset to assess the effectiveness of a range of potential treatments for Covid-19. Using advanced data science techniques, they discovered an unexpected pattern – namely, a drug used in the treatment of rheumatoid arthritis could save lives, reduce the need for a ventilator, and shorten patients’ stay in hospital. Such a breakthrough should be seen as an unalloyed success story for all those involved, whilst also containing within it some valuable lessons about how we treat data.

The most important, from my perspective, is that the data sets were held securely by NHS Digital, after full consent was granted by those involved. Not one item of data was taken without express permission, or used for any other purpose than that for which the data was originally sought. In other words, the data was willingly and knowingly given for a specific and transparent purpose. Safeguards were put in place, adhered to, and all parties acted responsibly throughout. Why can’t all data be used in this way?

Data Guardianship

Ultimately, the NHS Digital story, and others like it, reinforce the importance of the concept of ‘Data Guardianship’. That is, all actors in our data-rich economy need to take responsibility for minimising the damage their actions cause in the present, whilst making every reasonable effort to safeguard the future. The three pillars of Data Guardianship are:

  1. Organisations shouldn’t gather any data that might expose the subject to excessive privacy risks, now or in the future
  2. Data should not be hoarded ‘just in case’ – organisations should refuse to keep anything they don’t need
  3. Organisations should be proactive in explaining what data they’re collecting, how they intend to use it, and what rights the data subject has, in order to enable better decisions around consent

Ultimately, we have to make sure data doesn’t become the new oil, and instead find a metaphor that emphasises the positive values that underpin these pillars, instead of contradicting them. We can’t simply hope that some future phenomenon will make our data safe from abuse – we all need to educate ourselves, and then act accordingly, today. And if we can’t trust companies to behave responsibly, we shouldn’t give them our data in the first place. 

Perhaps we should think of our data as a vote that we cast in support of those organisations that are behaving best in the data-based economy. In fact, maybe that’s the new metaphor I’ve been searching for all along: data is the new democracy. 

 

 

It is common to hear that ‘data is the new oil,’ and whether you agree or not, there is certainly a lot of untapped value in much of the data that organisations hold.

Data is like oil in another way – it flows through pipelines. A data pipeline ensures the efficient flow of data from one location to the other. A good pipeline allows your organisation to integrate new data sources faster, provide patterns that you can replicate, gives you confidence in your data quality, and builds in security. But, data flow can be precarious and, when not given the correct attention, it can quickly overwhelm your organisation. Data can leak, become corrupted, and hit bottlenecks and, as the complexity of the requirements grow, and the number of data sources multiplies, these problems increase in scale and impact.

About this series

This is part one in our six part series on the data pipeline, taken from our latest playbook. Here we look at the very basics – what is a data pipeline and who is it used by? Before we get into the details, we just want to cover off what’s coming in the rest of the series. In part two, we look at the six main benefits of a good data pipeline, part three considers the ‘must have’ key principles of data pipeline projects, and parts four and five cover the essential practices of a data pipeline. Finally, in part six we look at the many pitfalls you can encounter in a data pipeline project. 

Why is a data pipeline critical to your organisation?

There is a lot of untapped value in the data that your organisation holds. Data that is critical if you take data analysis seriously. Put to good use, data can identify valuable business insights on your customers and your operations. However, to find these insights, the data has to be regularly, or even continuously, transported from the place where it is generated to a place where it can be analysed.

A data pipeline, consolidates data from all your disparate sources into one (or multiple) destinations, to enable quick data analysis. It also ensures consistent data quality, which is absolutely crucial for reliable business insights. 

So what is a data pipeline?

A data pipeline is a set of actions that ingest raw data from disparate sources and move the data to a destination for storage and analysis. We like to think of this transportation as a pipeline because data goes in at one end and comes out at another location (or several others). The volume and speed of the data are limited by the type of pipe you are using and pipes can leak – meaning you can lose data if you don’t take care of them.

The data engineers who create a pipeline are a critical service for any organisation. They create the architectures that allow the data to flow to the data scientists and business intelligence teams, who generate the insight that leads to business value.

A data pipeline is created for data analytics purposes and has:

Data sources – These can be internal or external and may be structured (e.g., the result of a database call), semi-structured (e.g., a CSV file or a Google Sheets file), or unstructured (e.g., text documents or images).

Ingestion process – This is the means by which data is moved from the source into the pipeline (e.g., API call, secure file transfer).

Transformations – In most cases, data needs to be transformed from the input format of the raw data, to the one in which it is stored. There may be several transformations in a pipeline.

Data quality/cleansing – Data is checked for quality at various points in the pipeline. Data quality will typically include at least validation of data types and format, as well as conforming with the master data.

Enrichment – Data items may be enriched by adding additional fields, such as reference data.

Storage – Data is stored at various points in the pipeline, usually at least the landing zone and a structured store (such as a data warehouse).

End users – more information on this is in the next section.

So, who uses a data pipeline?

We believe that, as in any software development project, a pipeline will only be successful if you understand the needs of the users. 

Not everyone uses data in the same way. For a data pipeline, the users are typically:

Business intelligence/management information analysts, who need data to create reports; 

Data scientists who need data to do an in-depth analysis of point problems or create algorithms for key business processes (we use ‘data scientist’ in the broadest sense, including credit risk analysts, website analytics experts, etc.)

Process owners, who need to monitor how their processes are performing and troubleshoot when there are problems.

Data users are skilled at visualising and telling stories with data, identifying patterns, or understanding significance in data. Often they have strong statistical or mathematical backgrounds. And, in most cases, they are accustomed to having data provided in a structured form – ideally denormalised – so that it is easy to understand the meaning of an individual row of data without the need to query separate tables or databases.

Is a data pipeline a platform?

Every organisation would benefit from a place where they can collect and analyse data from different parts of the business. Historically, this has often been met by a data platform, a centralised data store where useful data is collected and made available to approved people. 

But, whether they like it or not, most organisations are, in fact, a dynamic mesh of data connections which need to be continually maintained and updated. Following a single platform pattern often leads to a central data engineering team tasked with implementing data flows. 

The complexities of meeting everyone’s needs and ensuring appropriate information governance, as well as a lack of self-service, often make it hard to ingest new data sources. This can then lead to backlog buildup, frustrated data users, and frustrated data engineers. 

Thinking of these dataflows as a pipeline changes the mindset away from monolithic solutions, to a more decentralised way of thinking – understanding what pipes and data stores you need and implementing them the right way for that case whilst reusing where appropriate.

So now we have understood a little more about the data pipeline, what it is and how it works, we can start to understand the benefits and assess whether they align with your digital strategy.  We cover these in the next blog article, ‘What are the benefits of data pipelines?’

For more information on the data pipeline in general, take a look at our Data Pipeline Playbook.  And if you’d like us to share our experience of the data pipeline with you, get in touch using the form below.

 

 

If you’re a senior IT leader,  I’d like to make a prediction. You have faced a key data governance challenge at some time. Probably quite recently. In fact, there is a good chance that you’re facing one right now. I know this to be true, because clients approach us frequently with this exact issue. 

However, it’s not a single issue. In fact, over time we have come to realise that data is a slippery term that means different things for different people. Which is why we felt that deeper investigation into the subject was needed, to gain clarity and understanding around this overloaded term and to establish how we can talk to clients who see data governance as a challenge. 

So, what is data governance? And what motivates an organisation to be interested in it?

Through a series of surveys, discussions and our own experiences, we have come to the conclusion that client interest in data governance is motivated by the following wide range of reasons.

1. Data Security/Privacy

I want to be confident that I know the right measures are in place to secure my data assets and that we have the right protections in place.

2. Compliance – To meet industry requirements

I have specific regulations to meet (e.g. health, insurance, finance) such as:

  • Storage – I need to store specific data items for specified periods of time (or I can only store for specific periods of time).
  • Audit – I need to provide access to specified data for audit purposes.
  • Data lineage/traceability – I have to be able to show where my data came from or why a decision was reached.
  • Non-repudiation – I have to be able to demonstrate that the data has not been tampered with.

3. Data quality

My data is often of poor quality, it is missing data points, the values are often wrong, or out of date and now no-one trusts it. This is often seen in the context of central data teams charged with providing data to business functions such as operations, marketing etc. Sometimes data stewardship is mentioned as a means of addressing this.

4. Master/Reference Data Management

When I look at data about the same entities in different systems I get different answers.

5. Preparing my data for AI and automation

I am using machine learning and/or AI and I need to know why decisions are being made (as regulations around the use of AI and ML mature this is becoming more pressing – see for example https://ico.org.uk/for-organisations/guide-to-data-protection/key-data-protection-themes/explaining-decisions-made-with-ai/).

6. Data Access/Discovery

I want to make it easier for people to find data or re-use data – it’s difficult for our people to find and/or access data which would improve our business. I want to overcome my data silos. I want data consumers to be able to query data catalogues to find what they need.

7. Data Management

I want to know what data we have e.g. by compiling data dictionaries. I want more consistency about how we name data items. I want to employ schema management and versioning.

8. Data Strategy

I want to know what strategy I should take so my organisation can make better decisions using data. And how do I quantify the benefits?

9. Creating a data-driven organisation

I want to create an operating model so that my business can manage and gain value from its data.

I think it’s clear from this that there are many concerns covered by the term data governance. You probably recognise one, or maybe even several, as your own. So what do you need to do to overcome these? Well, now we understand the variety of concerns, we can start to address the approach to a solution. 

Understanding Lean Data Governance

Whilst it can be tempting for clients to look for an off-the-shelf solution to meet their needs, in reality, they are too varied to be met by a single product. Especially as many of the concerns are integral to the data architecture. Take data lineage and quality as examples that need to be considered as you implement your data pipelines – you can’t easily bolt them on as an afterthought.

Here at Equal Experts, we advocate taking a lean approach to data governance – identify what you are trying to achieve and implement the measures needed to meet them. 

The truth is, a large proportion of the concerns raised above can be met by following good practices when constructing and operating data architectures – the sorts of practices that are outlined in our Data Pipeline and Secure Delivery playbooks.  

We have found that good data governance emerges by applying these practices as part of delivery. For example:

  • Most Data security concerns can be met by proven approaches – taking care during environment provisioning, implementing role-based access control, implementing access monitoring and alerts and following the principles that security is continuous and collaborative.
  • Many Data Quality issues can be addressed by implementing the right measures in your data pipelines – incorporating observability through the pipelines – enabling you to detect when changes happen in data flows; and/or pragmatically applying master and reference data so that there is consistency in data outputs. 
  • Challenges with data access and overcoming data silos are improved by constructing data pipelines with an architecture that supports wider access. For example our reference architecture includes data warehouses for storing curated data as well as landing zones which can be opened up to enable self-service for power data users. Many data warehouses include data cataloguing or data discovery tools to improve sharing.
  • Compliance challenges are often primarily about data access and security (which we have just addressed above) or data retention which depends on your pipelines. 

Of course, it is important that implementing these practices is given sufficient priority during the delivery. And it is critical that product owners and delivery leads ensure that they remain in focus. The tasks that lead to good Data Governance can get lost when faced with excessive demands for additional user features. In our experience this is a mistake, as deprioritising governance activities will lead to drops in data quality, resulting in a loss of trust in the data and in the end will significantly affect the user experience.

Is Data Governance the same as Information Governance?

Sometimes we also hear the term Information Governance. Information Governance usually refers to the legal framework around data. It defines what data needs to be protected and any processes (e.g. data audits), compliance activities or organisational structures that need to be in place. GDPR is an Information Government requirement – it specifies what everyone’s legal obligations are in respect of the data they hold, but it does not specify how to meet those obligations. Equal Experts does not create information governance policies, although we work with client information governance teams to design and implement the means to meet them.

The field of data governance is inherently complex, but I hope through this article you’ve been able to glean insights and understand some of the core tenets driving our approach. 

These insights and much more are in our Data Pipeline and Secure Delivery playbooks. And, of course, we are keen to hear what you think Data Governance means. So please feel free to get in touch with your questions, comments or additions on the form below.