Unit testing in dbt – from an experiment to an open-source framework

In the following article, I want to share our journey to introduce the concept of unit testing in the framework dbt. There were a couple of existing efforts in the community but none as we envisioned – writing unit tests in SQL with a fast-feedback loop so that we can even use it for tdd.

Fellow colleague Pedro Sousa and I have published a couple of articles about our journey – we described our first experiment and shared our second and polished approach. After the blogs, a couple of teams at Equal Experts started to use our strategy and give us feedback.

As mentioned in one of the articles, we always thought dbt should have support for unit tests. We asked dbt’s team about the roadmap to support unit tests and we found it unlikely that it was going to happen. Also, they think it makes more sense as an external framework. Personally, it doesn’t make much sense to me, one can argue that if we look at programming languages, we are used to having the testing capabilities as external libraries, but dbt is not a programming language and it already supports other types of tests.

After a couple of conversations with the other teams, we were encouraged to use our work to create the dbt-unit-testing framework under the Equal Experts Github.

We released the framework three months ago, and since then we’ve started to have traction on our Github. Currently, we have 47 stars and 45 closed pull requests, and we have approximately 120 unique visitors per 14 days. The best outcome is having people collaborating with us, giving feedback, creating issues and developing pull requests. We already have four community contributors and we are proud to say that we appreciate all the work and the effort – @halvorlu, @darist, @charleslr and @gnilrets.

Community collaboration and feedback are crucial to improve the framework and prioritise what should be done. We have a couple of ideas in the backlog, such as adding support for more data sources, but we don’t yet have a clear roadmap. We prefer to listen to the feedback and work based on that. Continuous improvement through continuous user feedback perfectly describes our mindset.

This post shares our journey, mindset and appreciation for the open-source community engagement in such small projects.

You can check the framework here: https://github.com/EqualExperts/dbt-unit-testing

Contributing to tech communities is very much part of our mission at Equal Experts.

 

I’ve been interviewing our data engineers to understand how data engineering is seen these days, to find how people become data engineers, and to find how data engineering overlaps with software engineering.

I want to start by thanking all the data engineers for giving their time to do this with me. During the series, I was able to mature my own ideas about data engineering by learning from others experiences. I believe this series allows us to have a more common and transparent view of data engineering. Also, I really hope I have encouraged more software engineers to join the data field. 

For this final post,  I’m sharing my own takeaways from the series.

Data engineering is a specialism

One aspect I want to recall is that there are two kinds of data – application data and analytical data. Application data is the data that allows a business to run, and analytical data is the data that optimizes a business – when we are speaking about data engineering we are mostly thinking about analytical data.

There are two major contributors that make data engineering a separate and specialised silo in organisations.  One, the analytical data is taken into consideration far down the road. And two, the learning curve to work with some of the available data tools also contributes to the specialism, for instance when companies started to use big data processing tools like Hadoop, it was a whole complex new ecosystem.

However, I do believe that organisations are starting to see data as a crucial part of the businesses as a first-class concern, and to make this effective, the data work should be considered at the application level. Data mesh was referenced by multiple interviewees. It’s an emerging architecture that is starting to change the way organisations think and work with data. To make analytical data a first-class concern means that the engineers who work at the application level will also need to be aware of analytical data. The adoption of cross-functional teams, expanding data literacy for the nontechnical roles like product owners, and the evolution of data tools are fundamental for this to happen.

Alongside this evolution, we might start to see a generalisation of data engineering into the other engineering roles if the data tooling also evolves and the learning curve flattens. Although, nowadays we see data engineering being considered by the industry as a specialism.

From a software engineer to a data engineer journey

Apart from cases where people got exposure to data during their academic careers, the majority of the engineers that I interviewed shared that they were software engineers and they started to work on a data project by preference, by chance, or by accident.

I would say the journey isn’t effortless, the data landscape can be overwhelming and it’s a different context, but as we’ve seen during the series, the engineering skills are the same.

We’ve seen SQL mastery being recommended as one of the skills that every data engineer should have, and I fully agree with it, due to the cloud data warehouses being used more than ever for data workloads.

One of the interviewees mentioned that concepts might be more important than skills and I absolutely agree. So here is a list of concepts I consider fundamental:

  • Understanding storage formats and the applicability and particularities for each type
  • Data processing at scale
  • Stream processing
  • Understanding GDPR, data security, and privacy
  • Awareness of the data landscape and capabilities of cloud providers
  • Adaptability to a new environment and mindset is key, not just from a technology perspective, but from a business

I stated that this is fundamental but it doesn’t mean that an engineer needs all of this before enrolling in a data engagement. Please keep in mind that there is no one who knows everything, and learning on the job happens often and gracefully.

Lack of standards in data

With the massive growth of data, data tooling has been evolving at a fast pace. In 2004 Google introduced MapReduce as a programming model to handle large amounts of data, then the Hadoop ecosystem was created afterward and it became mainstream and slightly overused. 

Coexisting with the batch world there’s the streaming world, to answer problems with low latency, near real-time, which is called real-time analytics. At some point in time there was an explosion of interest in streaming and organisations started to adopt and overuse it, sometimes when there wasn’t a real gain in having real-time analytics.

The evidence is that often the technology is evolving and companies are being able to scale and become efficiently data-driven, but other times unnecessary complexity is introduced without a proper use case or a real need. A good example is when the data lake architecture became mainstream, some data lakes ended up as data swamps – a dumping ground for data, but without the tools to use it. 

We’re starting to see another pattern around the use of cloud data warehouses like BigQuery or Snowflake, which were referenced in multiple interviews. These modern data warehouses are scalable and cost-effective, easy to start, and they have started to be applied to more use cases than ever. This was reflected in the data landscape by the proliferation of tools that connect to the data warehouses to manage data. You might see this being referred to as the Modern Data Stack, which is a set of tools and strategies to manage data using the cloud data warehouses as the central place to store and process the data. 

The problem with the Modern Data Stack is that it’s not a well-defined stack, it’s just a term used to refer to data tools in this space, and there are a lot of them. Although we’re seeing organisations slowly converging to specific tools, for example dbt (data build tool) is becoming a pattern to handle transformations of data using SQL on top of the data warehouse.

That being said, the data space is evolving at a fast pace to cope with the data growth and it makes organisations try different strategies, sometimes driven by value, other times driven by technology. With the emergence of cloud data warehouses we are starting to see patterns, strategies, and tools that allow us to hopefully have a slightly less complex data world.

After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.

Based on this, I decided to create a blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering, and it might encourage more software engineers to become one.

This week the interview is with Yogendra Valani.​​

What is data engineering for you and how does it overlap with software engineering?

Of the data engineering projects I have worked on, I’ve seen a common fundamental requirement to provide reporting and data-driven insight that illustrates user behaviour and the impact on commercial goals. This is often an afterthought for the very teams that produce features or products. The shift towards using microservices has meant that data is now stored in various systems and formats. The main challenge data engineers face is collating multiple sources of data into a single place (i.e. data warehouse, data lake), to easily query and cross-reference data.

Traditionally, many of the problems solved by data engineers required being creative when working with limited server processing power and capacity. Due to this, most senior data engineering roles required experience maintaining database servers, optimising SQL scripts, and building search optimised indexes. A finely balanced trade-off had to be made between granularity and aggregation. Changing this trade-off required refactoring tightly coupled scripts, with multiple dependencies and backfilling data, which could be long and expensive.

The introduction of cloud-based solutions such as Google Big Query or Amazon Athena has enabled a new type of data engineering paradigm known as ELT (Extract, Load, and Transform), as opposed to ETL (Extract, Transform and Load). With the new tools, source systems are now copied in their entirety, helping data analysts and scientists work with raw data. Data structures are much easier to change, whilst the need to backfill from source systems is eliminated.

The role of a data engineer is evolving to be closer to that of a software engineer. We see demand from our users to build tools to interrogate data, whilst also mimicking contemporary software engineering practices by including automated tests, CI/CD pipelines, alerts and monitoring. A great example of this is DBT (https://www.getdbt.com/), a tool used to make SQL scripts in the transformation stage smaller, easier to read, maintain and test. 

How did you get involved in data engineering?

I have always wanted to combine using my maths education with software development. After a hackathon project I joined the data engineering team at Just Eat. The team had started migrating from a RedShift database to Google Big Query. They were overwhelmed by constant firefighting and considerable resistance from data analysts in migrating all the analysts reports to yet another system. 

We changed the migration strategy from a big bang switch over to working on a report-by-report basis, co-working with analysts to solve a complex reporting problem around delivery logistics. Trust in our data and platform grew, resulting in the onboarding of more users. Our backlog quickly changed from migrating a list of source system tables and associated reports, to use case driven feature requests.

What are the skills a data engineer must have that a software engineer usually doesn’t have?

Data isn’t always structured in formats that are easy to query. Most software engineers would find it easy to code for example a hierarchical structure that requires tree traversal. Trying to work with such a structure in SQL, perhaps with loops or recursive functions, requires a different thought process. 

Another key area to research and understand is the technical challenges and trade-offs between streaming versus batch processing. Batch processing tends to be much easier and cheaper than streaming with the majority of requests requiring a batch processing solution. 

What data trends are you keeping an eye on?

As more software engineers move into working on data engineering, I’m looking for tools that improve the development experience. I have been working with DBT and Airflow Cloud Composer. One of the most exciting libraries I have seen is the unit test framework (created by EE developers) for DBT. This has been an absolute game-changer, in terms of my development experience as I have been able to use test-driven development to write SQL scripts!!

Do you have any recommendations for software engineers who want to be data engineers?

Join a multidisciplinary team consisting of both software and data engineers. As database technology has evolved and many of the traditional approaches are no longer valid, it’s important to challenge the status quo.

 

After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.

Based on this, I decided to create a blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering, and it might encourage more software engineers to become one.

This week the interview is with Thorben Louw.

What is data engineering for you and how does it overlap with software engineering?

I’d say that data engineering is a software engineering specialism which takes the best practices from ‘traditional’ software engineering – let’s call that building applications and systems – but focuses these practices on the data domain. I mean stuff like unit testing, automation, CI and CD, telemetry and operability, and even how we deliver things – iterative and incremental approaches and early customer involvement.

There’s definitely a lot of overlap. Software engineers have to deal with data all the time, and data engineers write software all the time, so it’s really a question of degree. Actually, I’m not entirely convinced that it’s helpful to try and make a really definite distinction. For a data engineer, the main difference is that the data is the end-product, and the software that we write is the stuff which moves, shapes and measures data – i.e. data pipelines.

The data that we work with – and the landscape in which it lives and is consumed –  really controls the choices we have for implementation technologies and techniques. This can mean a more limited set of tools (platforms, libraries, languages), than might be available to a software engineer writing, say, a microservice or an app. 

How did you get involved in data engineering?

My background is in traditional software engineering, but I’ve always had an interest in pattern recognition and machine learning. Over the last few years, I got involved in a few data-heavy projects that involved machine learning, which resulted in my focus shifting to data.

As part of making these machine learning projects work repeatedly at scale, I had to get involved in productionising data pipelines and automating things like data cleaning and preparing training data sets. So I was collaborating with other specialists like data scientists, ML and software engineers to make the right data be in the right place, in the right shape, at the right time, and suddenly found myself doing data engineering.

What are the skills a data engineer must have that a software engineer usually doesn’t have?

There are definitely both skills and concept gaps between the people who specialise in making software that works at scale, the people who analyse data and build models, people who know how to visualise data, and the people who understand how to store, clean, move and prepare data. As a data engineer you’ll probably have to wear all of these hats some of the time, often being the collaborative glue between specialists. So, learning the vocabulary and a little bit of the tooling in each of these domains is useful – but I recognise that that’s also a daunting prospect! As a start, it’s useful to be comfortable with a tool like JupyterLabs that data scientists and analysts use all the time, and perhaps also with a popular machine learning framework like Tensorflow.

A modern data engineer is likely to be writing more complex SQL queries than they might have been used to, and diagnosing performance problems with queries, views etc. This often comes as a surprise to people who think modern data engineering is all about writing Scala code for Spark. It’s worth learning SQL well, and  also learning about modern data warehouses and column-store databases and how to leverage their power efficiently. You can often succinctly and trivially achieve things that would be a lot of effort to build yourself.

You might have to learn more about various data modelling strategies for OLAP and OLTP data use-cases. 

Learning more about the various kinds of data stores and data processing frameworks available from the various cloud providers is useful, if you haven’t had much reason to yet. Similarly, you might come across some more exotic data formats (like Avro or protobuf), and libraries for working with in-memory data, like Apache Arrow, or DLPack. But this can be different for every data source or project.

Then, there are popular tools and frameworks that classical software engineers might not have had exposure to. Off the top of my head, I can think of orchestration frameworks like Airflow/Prefect/Dagster, various ETL tools, and the trending ELT tools like DBT.

With all of these, I don’t think people should be put off trying out data engineering because they don’t know X! You learn what you need to as you need it.

I think the shift to thinking of data as a product takes a little getting used to. It’s not your code that’s precious – it’s the data it makes available to your end users.

Lastly, getting to know more about ethics and legal requirements around handling data, including legal requirements like GDPR, is a really good thing to do!

 What data trends are you keeping an eye on?

I’m watching how the MLOps movement matures. A lot of people have seen great benefits from being able to extract insights from their data, but people embarking on machine learning projects often massively underestimate all the other plumbing work it takes to make things successful. And while the modelling tools and frameworks have now almost become commodified, the work needed to produce and deploy good models consistently and reproducibly is mostly still quite bespoke. This plumbing includes stuff like versioning training data, making data available efficiently and affordably (maybe to custom hardware), measuring data quality, optimising model training and selection, and CI/CD practices around machine learning. I’ve seen estimates that this stuff can be 95% of the effort of an machine learning project! 

A particularly interesting thing in this area is the high-end projects that make use of dedicated machine learning hardware (like GPU clusters, Google TPUs, Graphcore IPUs, and systems from the likes of Cerebras, Sambanova and others). Optimising data movement to and from devices is critical and requires a deep understanding of the machine learning models and some understanding of hardware constraints (like memory, networking and disk bandwidth constraints, and new tools that like compilers that optimise models for these platforms). If people continue to train larger and larger models, this specialist skill will become critical, but luckily tools for it are also improving very rapidly.

In-memory computing seems really exciting and might have a big impact on how we load and process data in future.

Another thing is that the vast majority of data available to us for analysis is still unstructured data, and I think tools and libraries for working efficiently with raw text, images, audio and video have come along so quickly in the last decade. It will be amazing to see what the future holds here.

Lastly, I’m quite excited by the emergent data mesh paradigm, which encourages the right kind of decentralisation so that teams structure themselves and their implementations in ways appropriate to their data products. I think it’s our best bet yet for dealing with the rapidly growing data teams and data engineering activities many organisations are starting to struggle with.

Do you have any recommendations for software engineers who want to be data engineers?

Firstly, if data fascinates you, go for it! There’s so much exciting stuff happening in this space, and it all changes pretty fast. So don’t be afraid of just starting – right now! – and learning as you go. That’s pretty much how everyone does it.

I think there’s some vocabulary and perhaps unfamiliar tooling, which can be overwhelming at first and make you feel like some sort of imposter. But, if you have a good heart and a curious mind, you will pick stuff up in no time. There are lots of great resources and awesome blogs and videos. 

Also be aware that, in data, there’s plenty of exciting and important stuff happening outside of machine learning and data science –  those have just stolen the spotlight for now. Don’t ignore an opportunity because it doesn’t seem like you’ll be doing hip machine learning related stuff.

 

After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.

Based on this, I decided to create a blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering, and it might encourage more software engineers to become one.

This week the interview is with​ Himanshu Agarwal​.

What is data engineering for you and how does it overlap with software engineering?

I feel data engineering requires good software engineering skills in order to do things in the right way. The only thing that differs is that you need specialised data skills where you work on the data available, perform transformations, create insights, and make it available to data scientists and analysts in easy to use formats. 

How did you get involved in data engineering?

It’s an interesting journey. In my previous gig, they had a data client and were looking for data engineers from India. It was difficult to get enough people within the short timespan and that’s where my software engineer to data engineering journey started, from learning the concepts of Hadoop, Sqoop and Hive to working on data pipelines. After that I joined Equal Experts and went back to doing Java work for a year, but then I thought why not try data engineering again and see if it excites me. That’s when I started looking around for opportunities in this space and approached the recruitment team where we had an open role with one of our clients. Everything went well and continues to do so.

What are the skills a data engineer must have that a software engineer usually doesn’t have?

When I made the transition I thought it would be all around SQL, but later realised that I use most of my software engineering skills while working on the development, building, packaging and deployment  of the data pipelines and processes. Saying this, it doesn’t mean that you don’t need some specialised or advance skills that you’ll learn when you enter the world of data. Some of them are:

SQL – you definitely need to be very good with advanced SQL as it’s a backbone of data engineering and helps you in very quick data analysis to get back to people within a very short span.

Terraform / IoC – most of us have worked with Infrastructure As Code in software projects, but here you might need to skill up while creating a data platform and work with integrating many different sources and sinks.

Data storage options and data processing  – in the big data world, we have N number of options to do the same thing in different ways, so you need to be aware of multiple tools and techniques to do what you need, and use the right approach for current requirements.

Data modelling – it plays an important role in the data engineering world.

Scaling – you process data in TB, so you need to be always on your toes and think if a solution is scalable and optimised to handle a huge amount of data being processed.

Also, as a data engineer, you need to be aware about streaming and batch processing concepts, and how to do each one in an effective manner.

What data trends are you keeping an eye on?

After reading a few blogs from this series, data mesh was on everyone’s list, so I started reading about it and it looks like a new shift towards how data can be viewed as a product within each domain, handling their own data pipelines. So yes, this is something I’m looking at these days.

Also data space is continuously evolving with new approaches, solutions, frameworks coming in on how processing can be improved – so keeping a focus on how compute power can be utilised in a better way.

Do you have any recommendations for software engineers who want to be data engineers?

If you’re already working as a software engineer then don’t wait – just grab an opportunity to work with any data engineer and you should be able to make a mark with your engineering skills. Learn about data and its processing techniques on the go, as I have.

One more thing I should mention is that we still pair most of the time in data engineering work also.

After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.

Based on this, I decided to create a blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering, and it might encourage more software engineers to become one.

This week the interview is with​ Austin Poulton.

What is data engineering for you and how does it overlap with software engineering?

They are really the same fundamental discipline in my view. The principles of functional decomposition, repeatability, testing, observability and monitoring, apply to both software and data engineering. Data engineering is a specialisation where practitioners are often fluent in large data processing technologies, patterns and architectures over and above good software engineering practices. Data engineering is maturing to embody well established software engineering practises. It’s not merely about wrangling data for ad-hoc analysis.

How did you get involved in data engineering?

In the early days of my career I worked on pricing and provisioning analysis for telco networks. We relied on lots of training data and simulating synthetic data. The need to have reproducible transformation pipelines even in analytical settings was essential. Later, my experience of trade processing for a risk engine at an investment bank honed concepts of stream processing, eventual consistency, denormalised representations, data provenance (lineage) and so on. I’ve always been deeply interested in how data is structured and modelled for analytical and decision-making applications.

What are the skills a data engineer must have that a software engineer usually doesn’t have?

Data engineers need to be more deeply aware of how data should be organised for analytical and transformation workloads. Understanding how large datasets should be partitioned so that compute clusters can accurately and efficiently transform the data is critical.

SQL fluency is essential, it is the lingua franca not only of relational databases but big-data technologies too. 

Data quality is a massive concern for data-driven products. Understanding and working with tools that identify data issues, not only at the individual level but also for distributions over time is really valuable.

What data trends are you keeping an eye on?

The data science and engineering space is evolving continuously. That said, there are mature approaches emerging from the fermet. Data mesh architecture aligns well with building data products and domain driven design/organisation as opposed to an analytics lake or platform. ML Ops tooling and patterns are crystallising such that models have a ready and repeatable path to production and not consigned to static analysis in notebooks.

On the AI front, lots of interesting things are happening with natural language processing, such as the advent of GPT3. We are generating so much data that there is a world of opportunity in using AI tooling for structuring, tagging and linking semi and unstructured data. 

Do you have any recommendations for software engineers who want to be data engineers?

Transition is easier than you think and it’s a really interesting specialisation! If you haven’t, I suggest that you read Martin Kleppmann’s Designing Data Intensive Applications, as it distills many of the problems and approaches you will likely encounter in your data engineering journey.

 

After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.

Based on this, I decided to create a blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering, and it might encourage more software engineers to become data engineers.

This week the interview is with​ Jennifer Stark.

What is data engineering for you and how does it overlap with software engineering?

Data science has evolved so much in the past several years. For me, data engineering is what was (and perhaps still is), considered the first 80% of data science – data sourcing, data evaluation/profiling, cleaning, normalising, and calculating or creating new fields. What’s different now, with the evolution of data engineering as its own distinct field, is that data is processed not for a specific end goal or question necessarily, but in a way where the output can serve several end users or stakeholders, and answer several questions. The output can be used more flexibly. This has been achieved in part with the productionisation of data engineering, formalising it using software engineering practices, working in teams rather than as siloed individuals, code reviews, pair coding, mobbing, and retrospectives, testing, continuous deployment and integration, reporting/monitoring. Software engineering principles have improved the broader world of data science, and I believe can go a long way in improving how data is handled in academia. 

How did you get involved in data engineering?

I wanted to leave academia, but wasn’t sure what I’d do instead. I knew I enjoyed several aspects of academia – research, experiment planning, analysis using R, data visualisation also using R, and creating presentations (I enjoy using reveal.js because you can build slide decks using html/css, or markdown meaning you can embed custom animations, agent based models, video etc). I was not so keen on writing papers or relying on external funding and having to move every two to three years.

I took a 9-5 research assistant position while completing a part time masters in information visualisation which consisted of coding, statistics, and graphic/web design, among other things. I wanted more coding and stats, and data science was just starting to take off, so I then did a part time bootcamp in data science with python. I really enjoyed that, and got a postdoc position in computational journalism for 18 months out of it and an article published in The Washington Post where I used my new python skills.

After that I explored data science roles in industry and got a role as a data engineer. It appears to be a rather common thing, where a company wants to become data-led and do data science, but they have no pipelines and their data is everywhere! My role was to establish some pipelines and then I’d become the data scientist, and the engineer role was to be backfilled. Unfortunately priorities changed and I moved to another company. I’ve now been a data engineer for three and a half years. 

What are the skills a data engineer must have that a software engineer usually doesn’t have?

I’ve never been a software engineer, so I might be wrong with some of these. But familiarity with how data will be used, and the impact certain data cleaning or processing steps might have on how the data is used by the consumer. I try to consider how the data might be mis-used unintentionally, and what I can do as an engineer to mitigate that, including tests, documentation, data dictionaries or other supporting metadata.

For example, how best to deal with missing values might depend on what the data represents (discrete values, time series data, categories), how the API was designed, or on how the data will be or could be used. Is it best to fill the missing value with the average of the values on either side? Is filling a missing value with a null or a zero better? It all depends. Just being aware of these issues means that you can be proactive and seek advice from the domain experts for that particular data set – be it end users or the data providers – in order to select the right approach.

What data trends are you keeping an eye on?

I’m always a bit cautious when something is “trending”. Especially when something is presented as “the way we should be doing X now”, as I think it usually depends on the application context. It’s not a one size fits all. 

Having said that, I am keeping an eye on MLOps which is a facet of data engineering that is maturing into its own speciality. It’s a very fluid space at the moment, with tech itself and principles developing as we all try to figure out how to do it, which is quite exciting! 

Do you have any recommendations for software engineers who want to be data engineers?

I think this recommendation is valid no matter what your background, but I’d say lean on your teammates and ask questions, sense check your ideas, etc. Also, I love mapping things out in Miro, but maybe that’s an answer to a different question. 

As someone who has hired data engineers at junior and mid-levels with job ads citing software engineer experience as relevant, I believe software engineers can move into an equivalent position level-wise (e.g. mid-level software engineer into a mid-level data engineer). As with any role, I’d look for a team that’s collaborative. In this way, you’d gain expertise in data engineering that’s not covered by your software engineering experience, while you upskill the rest of the team in their software engineering game.  

Other folks in this series have said SQL. Yes! True also for anyone who works with data as an analyst, scientist, engineer, etc. I’d love it if SQL was more of a business-wide skill, like Excel, but that is probably just wishful thinking 😉 

After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.

Based on this, I decided to create a new blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering and it might encourage more software engineers to become data engineers.

This week the interview is with Gavin Campbell​.

What is data engineering for you and how does it overlap with software engineering?

Data engineering is all the stuff that happens to our data between the point where it is created and the point where it can be used for something valuable. In the past, this mostly involved using graphical ETL tools to extract data from the relational databases backing business applications or websites, transform it into a more or less dimensional schema, and load it into another database, commonly known as a data warehouse. The consumers of this data would then analyse this data using graphical reporting tools. Before the discovery of data engineering, this entire field of endeavour was known as business intelligence.

In recent years, the range of sources for potentially valuable data has expanded greatly to include data from website interactions, data from mobile devices, data only available from third-party APIs, and many more. The range of potential uses of this data has also expanded greatly, from pre-formatted reports to interactive dashboards, to consume the outputs of data analysis in the applications that generate the data – in recommender systems for example.

In parallel with this, there has been a creeping realisation that graphical tools don’t lend themselves to easy versioning, testing, packaging and deployment. Fortunately for the data engineers, the closely related field of software engineering solved most of these problems a very long time ago, provided that the code in question can be represented as text files. Accordingly, most modern data engineering involves writing code in a text editor, much like software engineering, and writing automated tests and deployment pipelines to deliver this code.

In short, data engineering is writing code that wrangles data using the practices that we know to deliver better outcomes in software delivery.

How did you get involved in data engineering?

I have spent most of my career skipping back and forward between software development, “data”, and the all-important “DevOps”. Having started as a fairly incompetent C/C++ programmer a very long time ago, I drifted into database administration at a time when database administration involved star-point screwdrivers and soldering irons. From there it was a natural progression into business intelligence and the realisation that the reason all this stuff was so unreliable was that nobody was writing any tests.

I expended a huge amount of energy trying to come up with satisfactory ways to write automated tests for these processes, ranging from SQL server stored procedures to ETL tools, to graphical reporting tools, and now to python notebooks.

I’m not sure that all of this effort has been 100% valuable, and I now think that there are some tools better thought of as end-user tools for which it isn’t worthwhile attempting to implement software delivery techniques. The irony, of course, is that tools like Tableau, PowerBI, and Qlik were all supposed to be end-user tools, yet the job listing websites are full of advertisements for Tableau, PowerBI, and Qlik developers.

This led to an increased amount of work helping data teams implement automated testing and deployment, during which I have dipped in and out of working with teams who work on actual websites that do useful things. I feel this has helped me understand what “good” looks like when working with data teams.

What are the skills a data engineer must have that a software engineer usually doesn’t have?

I think the hardest thing for software engineers is that data engineering almost always involves some kind of external platform, such as PostgreSQL, or Apache Spark, or Snowflake. What this means is that the way you write your own code can have dramatic effects on the performance or other behaviour of the platform, sometimes in non-obvious ways. These platforms also change the way we think about testing, since if almost all your code is going to be executed by Apache Spark, writing a complex mock for Apache Spark itself may not be the most valuable activity.

There are also certain types of change that are very expensive to make – generally, those involving a lot of data movement – which is a consideration that doesn’t apply to deploying web apps, for example. These changes, when necessary, need to be identified ahead of time.

What data trends are you keeping an eye on?

I think the data tools space currently consists of a lot of vendors attempting to eat each other’s lunches by expanding the core capabilities of their tools – such as data modelling, or data transformation – into areas traditionally served by other tools. This has led to a version of the classic “one-stop-shop” vs “best of breed” decision for many projects.

There are also customers who have decided all of this stuff is a bit too difficult and are gravitating towards no-code or low-code solutions. Naturally, there are other vendors rushing to fill this space with graphical tools that suffer from all the same problems as their predecessors from the 1990s.

Do you have any recommendations for software engineers who want to be data engineers?

Software engineers already have most of the technical skills needed to be data engineers. Often the “a-ha” moment comes from finding a problem that is difficult to solve at scale by churning out Java or Scala or Python code and finding a solution using Spark or similar platforms. Also, in a data engineering team, there will be people with strong statistical backgrounds but very little experience in the tools for software delivery, so individuals with a software engineering background can make a significant difference to the success of these teams. 

 

After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.

Based on this experience, I decided to create a new blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering and it might encourage more software engineers to become data engineers.

This week the interview is with Will Faithfull​.

What is data engineering for you and how does it overlap with software engineering?

Personally, I don’t see them as particularly separate disciplines, although the area of focus differs in a few ways. Firstly, it’s expected now that a good software engineer has a decent understanding of the domain in which they work, and I see data engineering as an extension of this principle. 

Secondly, you’re still building, testing, packaging and deploying applications at the end of the day, it’s the nature of the applications and the domain you’re working with that differs. Probably the biggest change is that the code you’re writing isn’t (necessarily) concerned with handling requests! I’d say it’s the perfect switch for a software engineer who feels a bit compartmentalised and would like to have a chance to be involved in everything technical.

How did you get involved in data engineering?

You might ask how I got involved in software engineering in the first place – I was doing a PhD, teaching at university, and tired of being destitute on a lab demonstrator salary, so I started my own company. 

I was working in software engineering and tech leadership, and didn’t really do any data engineering until 2020. It was actually a silver lining moment. In March 2020, at the first onset of the COVID pandemic, the project I was working on as a software engineer was shelved in the midst of the crisis. I was offered the opportunity to make a sideways move into data engineering and I was enthusiastic about such a move not because of the circumstances, but because I always had a lingering interest in data and data science, and a sense that I didn’t often get to use the skills I learned during my PhD.

I now split my time between data engineering and data science, so I’ve been able to carve out the exact role I wanted.

What are the skills a data engineer must have that a software engineer usually doesn’t have?

When I made the transition to data engineering, I was a bit in the dark about exactly what data engineering entailed, but my academic background gave me the confidence that I would be able to fill in the gaps – as it turns out I think the inverse was true, it was actually software engineering fundamentals that I leant on. Testing, application packaging, deployments, infrastructure, SQL, Linux and bash skills. 95% of it is the same principles, but the details of the applications you’re building and deploying differ.

That’s not to say that there aren’t skills and specialisms within data engineering that are less common to the average software engineer, but advantages in data disciplines include:

  • Infrastructure familiarity and infrastructure as code. I’d say that for data problems, applications tend to be shallower and the infrastructure deeper than equivalent software engineering work.
  • Breadth as opposed to depth of knowledge. In data applications there tends to be fewer cookie-cutter solutions and you have the opportunity to solve things in creative ways if you have the perspective.
  • Advanced SQL. When you’re working with huge volumes of data it is often more efficient to solve problems in SQL than in application code where possible.
  • Familiarity with storage paradigms, memory layouts, distributed computing frameworks and multiprocessing/parallelisation. Whether it’s optimisation or a bug, you will probably run into some problem that involves these sooner rather than later.
  • Familiarity with how data scientists approach problems and want to use data.

What data trends are you keeping an eye on?

I think graph databases are extremely underutilised but people have caught on to the potential. There’s also a trend towards incremental processing and point-in-time analysis. Both of these things relate to fundamentally the same point – that we have a tendency to flatten and aggregate data, whether that is overtime (depth), or associations (breadth). But the tools exist now to make sense of subgraphs and connections in the data without having to lose any information by flattening it, even at a large scale. The tools have some maturing to do, but the capability is there. 

Do you have any recommendations for software engineers who want to be data engineers?

Don’t be afraid to make the jump. If you have an interest or curiosity in the data domain and analytics side of things, even if you know next to nothing about them already, that will stand you in great stead. 95% of the skills will be second nature to you, so you only have to focus on that remaining 5%. You really have a chance to carve out the exact role you want under the auspices of the “Data Engineer” title.

After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed that the majority of the data engineers I spoke to were experienced software engineers.

Based on this, I decided to create a new blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering and encourage more software engineers to become data engineers.

This week the interview is with Davd Pardoe​​.

What is data engineering for you and how does it overlap with software engineering?

I have worked in the domain of data science and artificial intelligence for over 25 years. Data has been the driving force for every piece of work I have ever done. Manipulating data is a necessity for any kind of data science work. Most of the work, however, is typically done in an “offline” sense – the work is done at a snapshot in time. This means that the data engineering work can be done as it is required, and there is no real concept of “productionising” the work. This is where my experience of data engineering and software engineering work diverge. Data engineering in the data science space is about exploring something, whereas software engineering is carried out in order to build a working piece of software. This means that some of the critical aspects of best practice software engineering, such as continuous integration and continuous delivery, are not considered when carrying out data engineering for data science.

How did you get involved in data engineering?

Even from the early days of my career I was aware of the importance of the data engineering work I did to be repeatable. Either for similar projects or to repeat work (i.e. “productionising” it). To this end, I always endeavoured to fully comment the code I wrote and structure it well. In addition, many data science projects I have done have created datasets that served many other purposes and, therefore, needed to be delivered in a robust, repeatable way. This has meant I have drifted towards embracing software engineering best practices.

What are the skills a data engineer must have that a software engineer usually doesn’t have?

It is absolutely key to have an in-depth understanding of data, and not just the code that manipulates it. I have retained a single text book from my university days: SSADM (Structured Systems Analysis and Design Method). This book details how you represent a business and its core business processes in data terms. It discusses data entities and how they are related to each other through primary key – foreign key relationships. Even in the emergence of newer ways of storing and accessing data this knowledge is the fundamental building block for how to manipulate data. Of course, at the heart of this is understanding SQL. I was using SQL from day one of my very first job, and that has been the foundation for so much of what I have ended up doing. If a data engineer is given some datasets with little information (especially not a data dictionary!), they should be able to analyse them in order to understand them and answer fundamental questions like: “What does each row actually represent?” and “How do I uniquely identify each row?” All too often I have seen software engineers join datasets and end up duplicating data and not knowing why.

What data trends are you keeping an eye on?

In my view, streaming ETL is perhaps the most significant development in data engineering right now. This is a necessity to ensure data is made available in the most usable structure and format, and as up-to-date as possible. Historically, much transformation of data was done on a batch basis and transformed data was therefore only available for a snapshot in time that is immediately out-of-date. This is perfect for analytical and reporting purposes, but not for responding to data operating in real time.

Do you have any recommendations for software engineers who want to be data engineers?

It is essential to understand how data represents business processes. Consider the business process and think about the data that is created (or changed) at each step of the process. This translates to getting a thorough understanding of relational data concepts and SQL (not that you will necessarily be coding in it). This includes such things as knowing how to handle many-to-many relationships between entities, and knowing what inner and outer joins are. It is also important to understand how to transform data from one entity level to another. This is more than just aggregating, but also things like flagging if a ‘parent’ record has at least one ‘child’ record of a particular type. It will also help if you learn the specifics of how to apply software engineering best practices in the context of data engineering. Develop the skills to be able to test data engineering code. This is primarily done by querying or visualising the data before and after.

 

After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.

Based on this experience, I decided to create a new blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering and it might encourage more software engineers to become data engineers.

This week the interview is with Lewis Crawford.

What is data engineering for you and how does it overlap with software engineering?

I think the evocative term here is engineering – precision, reliable, secure and planned as opposed to ‘hacking’ which also has a lot of commonalities between data and software. Most data engineering has an aspect of software engineering for transforming the data into some more useful state, and most software has the ability to store, retrieve and modify data. The key difference is the scale of the data involved.

For me, the analogy to engineering in data engineering is best exemplified by a bridge. It is easy to visualise the roles of architects, structural engineers and builders that allow traffic to safely and securely move from one domain to another. Data engineers provide this bridge for data flows.

How did you get involved in data engineering?

I started with distributed computing for my MSc, processing satellite images to create drainage networks. There was always more data that would fit on a single ‘computer’. Using PVM condor for scheduling parallel processing jobs usually stuck in long queues behind physics and engineering simulations, the benefits of unit testing and code review are never more apparent than waiting two days to run your job only to find a simple spelling mistake, and being put at the back of the queue again. I ended up in roles in various companies that all had a distributed compute element so it was natural that I gravitated to ‘big data’ processing around about 2009. Along the way, I picked up a lot of experience around patterns and architectures, lineage and governance, data types and storage formats, in addition to the old problem of orchestrating multiple computers to perform a single task.

What are the skills a data engineer must have that a software engineer usually doesn’t have?

I would say concepts rather than skills and an appreciation of scale. Understanding why choosing the right storage format or why partitioning strategy can significantly improve the ability to perform analytics at scale. Understanding that processing data comes with a responsibility not just in ensuring quality and consistency but custodianship, data security, retention and governance. You don’t have to be a GDPR expert, but you do need to know that someone on the team has to ask the questions – what data are we using, for what purpose, for how long, where did it come from, and who is the end-user?

At some point, scale catches everyone out. Realising that your program that takes 1 second to process 1,000 records is going to take 11 days to process 1bn records – and that is if you don’t hit all kinds of limits you didn’t know where there (temp space, logging directories, etc.).

What data trends are you keeping an eye on?

I am fascinated by machine learning and the incredible advances in AI generally. So I tend to focus on the distributed architectures that support this in the data preparation stage such as Ray and Dask, as well as obvious platforms such as Spark. I am also interested in accelerated compute using GPU’s for both deep learning and transfer learning, but also frameworks like rapids.ai, that enable existing workloads to transfer to gpu with minimal code change.

Do you have any recommendations for software engineers who want to be data engineers?

Go for it! But it may be worth reading/chatting about/trying out some of these concepts – metadata, governance, lineage, OLTP / OLAP architecture, partitioning, columnar, eventual consistency, quality, and most of all SCALE!

We’ve worked with many customers to help them create microservice architectures. One pattern we’ve observed is that microservices sometimes take on a data engineering capability. When this happens, there’s a hidden data product, waiting to come to the fore.

With an established team already responsible for collecting and understanding the data, it can be a great starting point for an organisation new to data-driven decision making, and greatly empower capabilities such as Machine Learning.

What is a data product?

Just like a microservice, a data product is a domain-bounded, isolated product capability that  provides value to its users. Unlike a microservice, the users of a data product interact with it in an ad-hoc manner, and there’s no specific set of user interactions. 

A data product may be:

  • Surfaced through a business intelligence tool, for user generated reporting.
  • Combined by data scientists with other data products, to enable Machine Learning.
  • Brought into an operational data store, for real time usage by a microservice.
  • Leveraged by data engineers in a data pipeline to create new data products.

By extracting your data product, the owning team of the microservice can take responsibility for it and ensure it has the data engineering practices, data quality guarantees, and product ownership it needs. If a data product remains concealed within a microservice, users will struggle to leverage the data in a way that empowers the organisation.

Signs you’ve got a hidden data product

The following are some weak indicators of a hidden data product:

  • Users keep asking for more reports.
  • Direct requests for the data.
  • You can no longer process all the data in the application.

It’s a strong indicator if your microservice data is being used in a data pipeline. If it’s not being treated as a data product, it is probably fragile, lacking productionisation, and its development is being driven by microservice needs and not by other users of the data. This can look like:

  • Exports or replications of the data.
  • Users ask for more data sources to be added to enrich the data.
  • You spend most application development time handling data transformation.

One of these signals may not be cause enough for splitting out a data product, but a combination of them builds a very strong case! 

Example: Fraud case working application

A fintech company needs a way to protect and investigate potential fraudulent usages of its platform. To do so, they have created a case working tool for their fraud analysts that provides a summarised view across incoming transactions for quick assessment.

This data transformation logic takes place within the microservice, and is sourced from the raw unstructured data store.

The application naturally grew in size over time and several of our data product indicators are now present: 

  • Data transformations inside the application constitute most of the engineering work, with users frequently asking for it to be extended.
  • The RDS store of summarised result data has been seen as very valuable by data scientists, and easier to work with than the raw semi-structured data, and a replica has been shared.
  • The RDS replica is now also used as the basis for a machine learning pipeline.

These signals all point to the data inside the case working application as being exceptionally valuable, and worthy of being a data product.

How to surface your data product 

Start out by building a data pipeline for your data product, assuming it doesn’t exist already. We’ve got a Data Pipeline Playbook that can help you with this.

Your team may need to bring in some data engineering expertise, but the developers who own the relevant microservice will be able to pair with them, and promote a cross-discipline approach to implementing your new data product. 

Aim for a steel thread implementation, and try surfacing it directly to analytical users for feedback. After that, start to integrate it back into your microservice, and replace the microservice code. This is an example of the popular Strangler Fig pattern for incremental design. 

It’s important to remember that your microservice is no longer the sole driver of this data product, and that it may need to do its own transformation of the data, so as not to disrupt other consumers of your new data product.

Example: Fraud case working application

Back to the project team at our fintech. Now aware of the value of the summarised data in the case working app, they have taken the steps of separating it into its own data pipeline, running on the AWS EMR/Spark stack.

This has provided multiple benefits:

  • The footprint of the application is significantly reduced.
  • The data product is now leveraged not only by data scientists, but also other users such as performance analysts, giving them a cleaner data set to work with.
  • Other data products have begun to leverage the summarised data set beyond initial plans, in this case a growing product for analytical insights and fraud alerts.
  • Data enrichment, data quality, validation tests, and versioning management can all be managed around the data product itself.
  • They can now process much more data and extend its historical accuracy by moving it to a purpose built tool in Spark.

A lot of inspiration and central concepts for this blog come from the Data Mesh Architecture proposed by Zhamek Dhegani.

We wish your new Data Product the best of luck on its journey!

 

After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.

Based on this experience, I decided to create a new blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering and it might encourage more software engineers to become data engineers.

This week the interview is with me. 🙂

What is data engineering for you and how does it overlap with software engineering?

To make my opinion clear, I’m going to describe the goal of data in organisations so I can move into the specifics of data engineering.

Organisations have multiple systems generating data, transactional, usage data, etc. This data is valuable because it could and should be used to support decision making within an organisation. Sometimes this decision making is achieved by a manual process by looking into dashboards or reports generated by data analysts, some other times it’s achieved by machine learning algorithms created by data scientists that do the decisions by themselves. 

Making this data available and usable is the main work of a data engineer. From a high level perspective, the work consists of collecting, modelling, cleaning, transforming the data into datasets, or data streams ready to be consumed. It sounds simple but there are a lot of idiosyncrasies as you can imagine such as data security, schema changes, very big datasets, data quality, etc.

So, how do data engineers solve their tasks? Most of the time it’s by doing software engineering.

That being said, why do we have two different roles? Because the data area is so broad and it requires specific knowledge and a specific mindset to work with (the product is the data, the data needs to be tested apart from the software, etc).

I tend to see data engineering as a layer of skills and knowledge on top of software engineering skills. Also, I believe some software engineers have been working on data engineering tasks without being called ‘data engineers.’

How did you get involved in data engineering?

I started to work on data during my master thesis (and the following research) which was focused on natural language processing for Portuguese. I know it sounds more like data science, and it was. After this research phase, I worked for a couple of years doing software engineering and I ended up in a project where the client asked my team to do an ML model to predict user behaviour. I ended up working on the data science part and also developing an ML model pipeline and an infrastructure to make AB testing of models. These days, we tend to call this last part ML engineering, although I see it as part of data engineering. After this project, I’ve been working on data projects, mostly in data pipelines.

What are the skills a data engineer must have that a software engineer usually doesn’t have?

Starting with some generic topics: 

  • Strong SQL skills is a must-have. 
  • Knowing the details of different types of data storages (transactional databases, data warehouses, distributed file systems) is also very important. 
  • Data modeling is an important part of the job. 
  • It’s good to know how data scientists and data analysts work because they’re usually the clients of the data.
  • GDPR and data security speak for themselves.

Stream processing and big data processing are also very good skills to have because a good part of projects require them.  It’s also good to know the data landscape, which is huge, and there isn’t a preferred stack that is consistently widely used.

What data trends are you keeping an eye on?

I’m closely following the trend about leveraging the power of the modern cloud data warehouses which separate storage from computing like BigQuery or Snowflake, to make SQL data pipelines in an ELT fashion. It’s a game-changer, in my opinion.  The need to use tools like Spark and to have specialised engineers and infrastructure is minimised. Having the data pipelines in SQL allowed data analysts to participate in the transformation part, and there is a new role emerging based on that, analytics engineering.

I’m keeping an eye on data mesh, with a focus on treating data as a product and having a central team that facilitates the work on data for other teams.

I’m also interested in AutoML. As mentioned, I also have experience as a data scientist, not much, but sufficient to believe that some of that work can be automated. I do believe that AutoML can help data scientists, I don’t believe it can replace them.

Do you have any recommendations for software engineers who want to be data engineers?

It should not be hard if you already have software engineering skills. The field is broad, so you might want to choose one area that you would like to work in, data pipelines for instance, and start to study it from a practical perspective. If you are into books I recommend Designing Data-Intensive Applications as a general data book. Also, if you are part of a project where you can pair on data engineering tasks, give it a try.

After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.

Based on this, I decided to create a new blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering and it might encourage more software engineers to become data engineers.

This week the data engineer I’m interviewing is Scott Cutts.

What is data engineering for you and how does it overlap with software engineering?

I think data and software engineering are two sides of the same coin, and that is the business domain. Whereas software is about direct user and system requests and using domain-driven design to service these needs, data engineering is about providing insights, models and data in the same domain-driven way, to service user and system data needs. The key separation is that data engineering is always after the fact, on existing datasets that are used to create other datasets, insights or models. You also often don’t have complete control over how your users interact with your product, as they can query your datasets how they want, and will often use it in unexpected ways!

How did you get involved in data engineering?

It was an accident! I was on a fintech project as a software engineer, which required us to work closely with the client’s data scientists and provide real-time Insights from external sources, and produce aggregated counts that fed into their predictive models. This was all Scala/REST/Kafka work so I thought I was doing software engineering – really this was already data engineering, a good example of the blurred line between the two.

A group of us were asked if we could help improve the ETLs the data scientists were using to train their models, and spike out Spark compared to their existing Hadoop platform. We used Behaviour Driven Development and TDD to drive out our ETLs, just like software, and we got to work directly with data across the organisation. This was the best bit – with data transformation it feels like doing the best domain parts of software but even richer and with more understanding of the business.

What are the skills a data engineer must have that a software engineer usually doesn’t have?

I’d say 95% of skills are transferable, but some that would improve are:

  • DevOps and infrastructure as code. Most software engineers have experience in that, but with data you can expect more, both in the data platform infrastructure itself, plus sourcing/serving data from/to a variety of interfaces.
  • Multitool awareness and adaptability – there are a LOT of ways to solve data engineering problems out in the world. Not only are there different cloud providers with their own stacks, but every organisation is likely using it in very different ways to solve their problems with different architectures. Data engineering is still a while away from standardisation like software engineering tech stacks.
  • SQL – you will need to get better at it, even if it’s just for verifying your own data or doing analysis/debugging and not necessarily coding in it.
  • Optimisation – data is big, and will usually be slow and require several iterations of optimisation to work faster and cheaper.

What data trends are you keeping an eye on?

The Data Mesh architecture is the big one at the moment, as it looks to spell the end of the data monoliths (warehouses & lakes), and move towards the digital platform and domain-driven solutions that work so well in software. The tooling is fast developing to match this, with abstracted, TDD compliant frameworks like dbt that let data teams focus on the solution, and less on the underlying infrastructure.

MLOps I find very interesting as well, and providing an automated, dependable way to deploy models at the organisation level that are trusted by users is a great challenge. This part of the industry is moving really fast and best practices are only recently being encoded.

Do you have any recommendations for software engineers who want to be data engineers?

Most “data projects” have a mix of software and data engineering. Find one of those projects, enter it as software and start pairing with the data engineers. Your software skills will be useful immediately, and you’ll soon learn the rest.

 

After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.

Based on this experience, I decided to create a new blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering and it might encourage more software engineers to become data engineers.

This week the data engineer is Paul Brabban.

What is data engineering for you and how does it overlap with software engineering?

I’m not convinced data engineering as a specialism is really much more than a product of immature and inadequate tooling. A decade ago, we were largely stuck with vertically scalable, on-prem data warehouses that operated very much like a traditional relational database, albeit perhaps with column-oriented storage. Compute and storage limited how much you could really do there. In that world, specialised Data Engineers are needed to persuade tools like Airflow to move data to and around your warehouse, or can fiddle with indexes and configuration in the warehouse to make needed queries feasible.

Today, we have “modern data warehouses” like BigQuery and Snowflake that can directly consume a far greater variety of data than before, and once landed in the warehouse you can use “SQL pipelines” to express new concepts as views – with much more direct visibility of costs and much simpler setup. Some of these new warehouses even integrate things like machine learning to really reduce the need for data engineering skills and help folks like analysts, data scientists and decision makers really get things done.

I think there’s a big overlap with software engineering. I’ll leave someone else to speak on practices, and say that in the real world, very few software engineers can avoid data! Every team I’ve worked with has had some need to move data around – right now, it’s a team running a search system for a client’s website and apps. Somehow the data ends up in the search index – although these folks aren’t “data engineers” and what they’ve built doesn’t use “data engineering tools” it’s a data pipeline and no mistake. I’d argue that some data engineering experience could have smoothed that journey and made good use of existing tools, saving writing some code, but the job got done and that’s what matters!

In a nutshell – I see data engineering as helping others make data available and effectively use the data they need. I was about to point to specialisms like optimising larger datasets for performance but then I remember that software engineers do that too!

How did you get involved in data engineering?

I was a software developer for about a decade. I am still a software developer – I’ve just finished writing an application to rotate credentials for a data system – using all the linting, testing and code organisation skills I’ve gained over the years. I found myself drawn to the challenges of dealing with larger datasets, making slow things fast and the allure of the incredible insights that data can be hiding. I guess I’ve also always been drawn to evidence-based methods, too. Perhaps that explains why I spend so much time wrestling with AB tests!

What are the skills a data engineer must have that a software engineer usually doesn’t have?

Lack of knowledge of the tooling and ecosystem is one thing – it’s very easy to build your own thing unless you know a little about the terminology data folks use. I think experience would be up there – as a software engineer you’ll likely spend most of your time writing software and only occasionally tackle a data problem. The last thing that springs to mind is…SQL. Nowadays, it’s quite possible for a software engineer to go for long periods of time without coming into contact with SQL – but SQL is the foundation, and with SQL pipelines becoming more of a thing it’s only going to become more important to read and write it.

What data trends are you keeping an eye on?

I’m watching and prodding the Data Mesh approach with interest. I think it’s a really promising approach to solve two crucial problems with traditional data engineering – skills shortage and centralisation. There’s a lot in there, but organising into data products and decentralising responsibility seem like a sensible idea – particularly as it’s basically what we’ve already seen work with “normal” products. As I said, there is not so much difference between data and non-data engineering!

Do you have any recommendations for software engineers who want to be data engineers?

There’s a good chance you are already doing data engineering. Have a think about what you’ve done in the recent past, see if you’ve been involved with moving data around. I bet you have. Have a look around for blog posts and the like for how others approached those problems, you might get some ideas and jumping-off points into how data engineering techniques might have saved you time or improved the product.

 

 

As a follow-up from Language Agnostic Data Pipelines, the following post is focused on the use of dbt (data build tool).

Dbt is a command-line tool that enables us to transform the data inside a Data Warehouse by writing SQL select statements which represent the models. There is also a paid version with a web interface, dbt cloud, but for this article let’s consider just the command-line tool.

The intent of this article is not to make a tutorial about dbt – that already exists here, nor one about TDD, the goal is to illustrate how one of our software development practices, test-driven development, can be used to develop the dbt models.

Testing strategies in dbt

Dbt has two types of tests:

  • Schema tests: Applied in YAML, returns the number of records that do not pass an assertion — when this number is 0, all records pass and therefore your test passes.
  • Data tests: Specific queries that return 0 records.

Both tests can be used against staging/production data to detect data quality issues.

The second type of test gives us more freedom to write data quality tests. These tests run against a data warehouse loaded with data. They can run on production, on staging, or for instance against a test environment where a sample of data was loaded. These tests can be tied to a data pipeline so they can continuously test the ingested and transformed data.

Using dbt data tests to compare model results

With a little bit of SQL creativity, the data tests (SQL selects) can be naively* used to test model transformations, comparing the result of a model with a set of expectations:

with expectations AS (

   select 'value' as column1,

   union all 

   Select 'value 2' as column1

)


select * from expectations

except

select * from analytics.a_model

The query returns results when the expectations differ, so in this case dbt reports a test failure. However, this methodology isn’t effective to test the models due to the following facts:

  • The test input is shared among all the tests (this could be overcome by executing dbt test and the data setup for each test, although it’s not practical due to the lack of clarity and the maintainability of test suites).
  • The test input is not located inside the test itself, so it’s not user friendly to code nor easy to understand the goal of each test.
  • The dbt test output doesn’t show the differences between the expectations and the actual values, which slows down the development.
  • For each test, we need to have a boilerplate query with the previous format (with expectations as…).

Considering these drawbacks, It doesn’t seem like the right tool to make model transformation tests.

A strategy to introduce a kind of ‘data unit tests’

It’s possible and common to combine SQL with the templating engine Jinja (https://docs.getdbt.com/docs/building-a-dbt-project/jinja-macros). Also, It’s possible to define macros which can be used to extend dbt’s functionalities. That being said, let’s introduce the following macro:

unit_test(table_name, input, expectations)

The macro receives:

  • A table name (or a view name).
  • An input value that contains a set of inserts.
  • A table of expectations.

To illustrate the usage of the macro, here is our last test case refactored:

{% set table_name = ref('a_model') %}


{% set input %}

insert into a_table(column1) values (‘value’), (‘value2’);

{% endset %}


{% set expectations %}

select 'value' as column1,

union all 

select 'value 2' as column1

{% endset %}


{{ unit_test(table_name, input, expectations) }}

There is some boilerplate when using Jinja to declare the variables to call the unit test macro. Although, it seems a nice tradeoff, because this strategy enables us to:

  • Simplify the test query boilerplate.
  • Setup input data in plain SQL and in the same file.
  • Setup expectations in plain SQL and in the same file.
  • Run each test segregated from other tests.
  • Show differences when a test fails.

To illustrate the usage of this approach, here is a demo video:



The previous macro will be available in the repo published with the Language Agnostic Data Pipelines.

*naively coded because the use of EXCEPT between both tables fails to detect if duplicate rows exist. It could be fixed easily, but for illustrative purposes, we preferred to maintain the example as simple as we can.

Bringing software engineering practices to the data world

It is also easy to apply other standard software development practices such as integration with a ci/cd environment in dbt. This  is one of the advantages of using it over transforming data inside ETL tools which use a visual programming approach.

Wrapping up, we advocate that data oriented projects should always use the well-known software engineering best practices. We hope that this article shows how you can apply TDD  using the  emerging DBT data transformation tool.

Pedro Sousa​ paired on this journey with me. He is taking the journey from software engineering to data engineering in our current project, and he helped on the blog post.

Contact us!

For more information on data pipelines in general, take a look at our Data Pipeline Playbook.  And if you’d like us to share our experience of data pipelines with you, get in touch using the form below.

Based on the experience shared in evolving a client’s data architecture, we decided to share a reference implementation of data pipelines. Recalling from the data pipeline playbook.

What is a Data Pipeline? 

From the EE Data Pipeline playbook:

A Data Pipeline is created for data analytics purposes and has:

  • Data sources – these can be internal or external and may be structured (e.g. the result of a database call), semi-structured (e.g. a CSV file or a Google Sheet), or unstructured (e.g. text documents or images).
  • Ingestion process – the means by which data is moved from the source into the pipeline (e.g. API call, secure file transfer).
  • Transformations – in most cases data needs to be transformed from the input format of the raw data, to the one in which it is stored. There may be several transformations in a pipeline.
  • Data Quality/Cleansing – data is checked for quality at various points in the pipeline. Data quality will typically include at least validation of data types and format, as well as conforming against master data. 
  • Enrichment – data items may be enriched by adding additional fields such as reference data.
  • Storage – data is stored at various points in the pipeline. Usually at least the landing zone and a structured store such as a data warehouse.

Functional requirements

  • Pipelines that are:
    • Easy to orchestrate
    • Support scheduling 
    • Support backfilling
    • Support testing on all the steps
    • Easy to integrate with custom APIs as sources of data
    • Easy to integrate in a CI/CD environment
  • The code can be developed in multiple languages to fit each client skill set when python is not a first class citizen. 

Our strategy 

In some situations a tool like Matillion, Stitchdata or Fivetran can be the best approach, although it’s not the best choice for all of our client’s use cases. These ETL tools work well when using the existing pre-made connectors, although when the majority of the data integrations are custom connectors, it’s certainly not the best approach. Apart from the known cost, there is also an extra cost when using these kinds of tools – the effort to make the data pipelines working in a CI/CD environment. Also, at Equal Experts, we advocate we should test each step of the pipeline, and if possible, develop them using test driven development – and this is near impossible in these cases.

That being said, for the cases when an ETL tool won’t fit our needs, we identified the need of having a reference implementation that we can use for different clients. Since the skill set of each team is different, and sometimes Python is not an acquired skill, it was decided not to use the well known python tools that are used these days for data pipelines like  Apache Airflow or Dagster. 

So we designed a solution using Argo Workflows as the orchestrator. We wanted something which allowed us to define the data pipelines as DAGs like Airflow. 

Argo Workflows is a container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo represents workflows as Dags (Directed Acyclic Graphs), and each step of the workflow is a container. Since data pipelines can be easily modeled as workflow it is a great tool to use. Also, we have freedom to choose which programming language to design the connectors or the transformations, the only requirement is that each step of the pipeline should be containerised.

For the data transformations, we found that dbt was our best choice. Dbt allows the transformations needed between the staging tables and the analytics tables. Dbt is SQL centric, so there isn’t a need to learn another language. Also, dbt has features that we wanted like testing and documentation generation and has native connections to Snowflake, BigQuery, Redshift and Postgres data warehouses. 

With these two tools, that is how we ended up with a language agnostic data pipelines architecture that can be easily reused and adapted in multiple cases and for different clients.

Reference implementation

Because we value knowledge sharing, we have created a public reference implementation of this architecture in the github repo which shows a pipeline for a simple use case of ingesting UK COVID-19 data (https://api.coronavirus.data.gov.uk) as an example.

The goal of the project is to have a simple implementation that can be used as an accelerator to other teams. It can be easily adapted to make other data pipelines, to integrate in a CI/CD environment, or to extend the approach and make it work for different scenarios. 

The sample project uses a local kubernetes cluster to deploy Argo and the containers which represent the data pipeline. Also a database where COVID-19 data is loaded and transformed and an instance of Metabase to show the data in a friendly dashboard.

We’re planning to add into the reference implementation infrastructure as code to deploy the project on AWS and GCP. Also, we might also work in aspects like facilitating the monitoring of the data pipelines when deployed in a cloud, or using Great Expectations.

Transparency is at the heart of our values

We value knowledge sharing and collaboration, so we hope that this article, along with the data pipelines playbook will help you to start creating data pipelines in whichever language you choose. 

Contact us!

For more information on data pipelines in general, take a look at our Data Pipeline Playbook.  And if you’d like us to share our experience of data pipelines with you, get in touch using the form below.

 

In the mid 2010’s there was a step change in the rate at which businesses started to focus on gaining valuable insights from data.

As the years have passed, the importance of data management has started to sink in throughout the industry. Organisations have realised that you can build the best models, but if your data isn’t qualitative, your results will be wrong.

There are many, varied job roles within the data space. And I always thought the distinction of the roles were pretty obvious. However, recently a lot has been written about the difference between the different data roles, and more specifically the difference between Data Scientists and Data Engineers. 

I think it’s important to understand that not knowing these differences can be instrumental in teams failing or underperforming with data. Which is why I am writing this article. To attempt to clarify the roles, what they mean, and how they fit together. I hope that this will help you to understand the differences between a Data Scientist and a Data Engineer within your organisation.

What do the Data Engineer and Data Scientist roles involve?

So let’s start with the basics. Data Engineers make data available to the business, and Data Scientists enable decisions to be made with the data. 

Data Engineers, at a senior level, design and implement services that enable the business to gain access to its data. They do this by building systems that automagically ingest, transform and publish data, whilst gathering relevant metadata (lineage, quality, category, etc.), enabling the right data to be utilised.  

Data Scientists not only utilise the data made available, but also uncover additional data that can be combined and processed to solve business problems.  

Both Data Scientists and Data Engineers apply similar approaches to their work.  They identify a problem, they look for the best solution, then they implement the solution. The key difference is the problems they look at and, depending on their experience, the approach taken to solving it.  

Data Engineers like Software Engineers, or even more generally engineers, tend to use a process of initial development, refinement and automation.  

Initial development, refinement and automation explained, with cars.

In 1908 Henry Ford released the Model T Ford. As you can see, it has many of the same features as a modern car – wheels on each corner, a bonnet, a roof, seats, a steering wheel, brakes, gears.  

 

In 1959 the first Mini was released.  It had all the same features as the Model T Ford. However, it was more comfortable, cheaper, easier to drive, easier to maintain, and more powerful. It also incorporated new features like windscreen wipers, a radio, indicators, rear view mirrors. Basically, the car had, over 50 years, been incrementally improved.  

Step forward in time to 2010, and Tesla released the Models S and X. These too have many features we can see in the Model T Ford and the Mini.  But now they also contain some monumental changes.

The internal combustion engine is replaced with electric drive. It has sat-nav, autopilot, and even infotainment. All of which combine to make the car much easier and more pleasurable to drive.

What we are seeing is the evolution of the car from the initial production line – basic but functional – through multiple improvements in technology, safety, economy, driver and passenger comforts. All of which improve the driving experience.  

In other words we are seeing initial development, refinement and automation. A process that Data Engineers and Data Scientists know only too well.

For Data Engineers the focus is on data, getting it from source systems to targets, ensuring the data quality is qualified, the lineage captured, the attributes tagged, and the access controlled. 

What about Data Scientists?  They absolutely follow the same pattern, but they additionally look to develop analytics along the Descriptive, Diagnostic, Predictive, Prescriptive scale.  

So why is there confusion between the Data Scientist and Data Engineer roles?  

There is of course not a single answer but some of the common reasons include:

  • At the start, both Data Scientist and Data Engineers spend a lot of time Data Wrangling. This means trying to get the data into a shape where it can be used to deliver business benefits.
  • At first, the teams are often small and they always work very closely together, in fact, in very small organisations they may be the same person – so it’s easy to see where the confusion might come from.
  • It’s often given to Data Engineers to “productionise” analytics model created by Data Scientists.
  • Many Data Engineers and Data Scientists dabble in each other’s areas, as there are many skills both roles need to employ. These can include data wrangling, automation and algorithms..  

As the seniority of data roles develop, so do the differences.

When I talk to and work with Data Engineers and Data Scientists, I can often categorise them into one of three categories – Junior, Seasoned, Principal – and when I work with Principals, in either space, you can tell they are a world apart in their respective fields.  

So what differentiates the different levels and roles?

That’s it. I hope this article helps you to more easily understand the differences between a Data Scientist and a Data Engineer. I also hope this helps you to more easily identify both within your organisation.  If you’d like to learn more about our Data Practice at Equal Experts, please get in touch using the form below.

 

Knowing, understanding and managing your data throughout its lifecycle is more important than it has ever been. And more difficult. 

Of course, the never ending growth in data volume is partly responsible for this, as are also countless processes that need to be applied to the data to ensure it is usable and effective. Which is why data analysts and data engineers turn to data pipelining.

Added complexity is involved when, In order to keep abreast of the latest requirements, organisations need to constantly deploy new data technologies alongside legacy infrastructure. 

All three of these elements mean that, inevitably, data pipelines are becoming more complicated as they grow. In the final article in our data pipeline series, we have highlighted some of the common pitfalls that we have learned from our experience over the years and how to avoid them. These are also part of our Data Pipeline Playbook.

About this series

This is the final post in our six part series on the data pipeline, taken from our latest playbook. Now we look at the many pitfalls you can encounter in a data pipeline project. In the series before now, we looked at what a data pipeline is and who it is used by. Next we looked at the six main benefits of a good data pipeline, part three considered the ‘must have’ key principles of data pipeline projects, and part four and five covered the essential practices of a data pipeline. So here’s our list of some of the pitfalls we’ve experienced when building data pipelines in partnership with various clients. We’d encourage you to avoid the scenarios listed below.

Avoid tightly coupling your analytics pipelines with other business processes

Analytics data pipelines provide data to produce insights about your customers, business operations, technology performance, and more. For example, the role of a data warehouse is to create an historical record of data that can be mined for insights.

It is tempting to see these rich data sources as the best source of data for all data processing and plumb key business activities in these repositories. However, this can easily end up preventing the extraction of insights it was implemented for. Data warehouses can become so integrated into business operations – effectively acting as the Operational Data Store (ODS) – that they can no longer function as a data warehouse. Key business activities end up dependent on the fast processing of data drawn from the data warehouse, which prevents other users from running queries on the data they need for their analyses.

Modern architectures utilise a micro-service architecture, and we advocate this digital platform approach to delivering IT functionality (see our Digital Platform Playbook). Micro-services should own their own data – and as there is unlikely to be a one-size-fits-all solution to volumes, latencies, or use of master or reference data of the many critical business data flows implemented as micro-services. Great care should be taken as to which part of the analytics data pipelines they should be drawn from. The nearer the data they use is to the end users, the more constrained your data analytics pipeline will become over time, and the more restricted analytics users will become in what they can do.

If a micro-service is using a whole pipeline as part of its critical functionality, it is probably time to reproduce the pipeline as a micro-service in its own right, as the needs of the analytics users and the micro-service will diverge over time.

Include data users early on

We are sometimes asked if we can implement data pipelines without bothering data users. They are often very busy interfacing at senior levels, and as their work provides key inputs to critical business activities and decisions, it can be tempting to reduce the burden on them and think that you already understand their needs.

In our experience this is nearly always a mistake. Like any software development, understanding user needs as early as you can, and validating that understanding through the development, is much more likely to lead to a valued product. Data users almost always welcome a chance to talk about what data they want, what form they want it in, and how they want to access it. When it becomes available, they may well need some coaching on how to access it.

Keep unstructured raw inputs separate from processed data

In pipelines where the raw data is unstructured (e.g. documents or images), and the initial stages of the pipeline extract data from it, such as entities (names, dates, phone numbers, etc.), or categorisations, it can be tempting to keep the raw data together with the extracted information. This is usually a mistake. Unstructured data is always of a much higher volume, and keeping it together with extracted data will almost certainly lead to difficulties in processing or searching the useful, structured data later on. Keep the unstructured data in separate storage (e.g., different buckets), and store links to it instead.

We hope that this article, along with all the others in the series, will help you create better pipelines and address the common challenges that can occur when building and using them. Data pipeline projects can be challenging and complicated, but done correctly they securely gather information and allow you to make valuable decisions quickly and effectively. 

Contact us!

For more information on data pipelines in general, take a look at our Data Pipeline Playbook.  And if you’d like us to share our experience of data pipelines with you, get in touch using the form below.

Managing the flow of information from a source to the destination system forms an integral part of every enterprise looking to generate value from their data.

Data and analytics are critical to business operations, so it’s important to engineer and deploy strong and maintainable data pipelines by following some essential practices.

This means there’s never been a better time to be a data engineer. According to DICE’s 2020 Tech Job Report, Data Engineer is the fastest-growing job in 2019, growing by 50% YoY. Data Scientist is also up there on the list, growing by 32% YoY.

But the parameters of the job are changing. Engineers now provide guidance on data strategy and pipeline optimisation and, as the sources and types of data become more complicated, engineers must know the latest practices to ensure increased profitability and growth. 

In our data pipeline playbook we have identified eleven practices to follow when creating a data pipeline. We touched on six of these practices in our last blog post. Now we talk about the other five, including iteratively creating your data models as well as observing the pipeline.  Applying these practices will allow you to integrate new data sources faster at a higher quality.

About this series

This is part five in our six part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. Next we considered the “must have” key principles of data pipeline projects  and in part four, we looked at the six key practices needed for a data pipeline. Now we go into details of more of those practices, before finishing off our series in part six with a look at the many pitfalls you can encounter in a data pipeline project. 

Practice Seven: Observe the pipeline

Data sources can suddenly stop functioning for many reasons – unexpected changes to the format of the input data, an unanticipated rotation of secrets or change to access rights, or something happens in the middle of the pipeline that drops the data. This should be expected and means of observing the health of data flows should be implemented. Monitoring the data flows through the pipelines will help detect when failures have occurred and prevent adverse impacts. Useful tactics to apply include:

  • Measuring counts or other statistics of data going in and coming out at various points in the pipeline.
  • Implementing thresholds or anomaly detection on data volumes and alarms when they are triggered.
  • Viewing log graphs – use the shapes to tell you when data volumes have dropped unexpectedly.

Practice Eight: Data models are important and should be addressed iteratively

For data to be valuable to the end users (BI teams or data scientists), it has to be understandable at the point of use. In addition, analytics will almost always require the ability to merge data from sources. In our experience, many organisations do not suffer from big data as much as complex data – with many sources reporting similar or linked data – and a key challenge is to conform the data as a step before merging and aggregating it.

All these challenges require a shared understanding of data entities and fields – and need some kind of data model to resolve to.  If you ignore this data model at the start of the pipeline, you will have to address these needs later on.

However, we do not recommend the development of an enterprise data model before data can be ingested into the system. Rather, starting with the needs of the data users in the initial use cases will lead you to a useful data model that can be iterated and developed over time.

Practice Nine: Apply master data/reference data pragmatically to support merging

Most pipelines require data to be conformed not just to the schema but also against known entities such as organisational units, product lists, currencies, people, companies, and so forth. Ignoring this master data on ingestion will make it harder to merge data later on. However, master data management often becomes overwhelming and starts to seem as if the whole enterprise needs modelling. To avoid data analysis paralysis, we recommend starting from the initial use cases and iteratively building reference data and master data into the pipelines as they are needed.

Practice Ten: Use orchestration and workflow tools

Pipelines typically support complex data flows composed of several tasks. For all but the simplest pipelines, it is good practice to separate the dataflow from the code for the individual tasks. There are many tools that support this separation – usually in the form of Directed Acyclic Graphs (DAGs). In addition to supporting a clear isolate and reuse approach, and enabling continuous development through providing version control of the data flow, DAGs usually have a simple means of showing the data dependencies in a clear form, which is often useful in identifying bugs and optimising flows.

Depending on the environment and the nature and purpose of the pipeline, some tools we have found useful are:

  •   Apache Airflow
  •   dbt
  •   Argo Workflows
  •   DVC
  •   Dagster
  •   AWS Glue

Practice Eleven: Continuous testing

As with any continuous delivery development, a data pipeline needs to be continuously tested. However, data pipelines do face additional challenges such as:

  • There are typically many more dependencies such as databases, data stores and data transfers from external sources, all of which make pipelines more fragile than application software – the pipes can break in many places. Many of these dependencies are complex in themselves and difficult to mock out.
  • Even individual stages of a data pipeline can take a long time to process – anything with big data may well take hours to run. Feedback time and iteration cycles can be substantially longer.
  • In pipelines with Personally Identifiable Information (PII), PII data will only be available in the production environment. So how do you do your tests in development? You can use sample data which is PII-clean for development purposes. However, this will miss errors caused by unexpected data that is not in the development dataset, so you will also need to test within production environments – which can feel uncomfortable for many continuous delivery practitioners.
  • In a big data environment, it will not be possible to test everything – volumes of data can be so large that you cannot expect to test against all of it.

We have used a variety of testing practices to overcome these challenges:

  • The extensive use of integration tests – providing mock-ups of critical interfaces or using smaller-scale databases with known data to give quick feedback on schemas, dependencies and data validation.
  • Implementing ‘development’ pipelines in the production environment with isolated ‘development’ clusters and namespaces. This brings testing to the production data, avoiding PII issues, and sophisticated data replication/emulation across environments.
  • Statistics-based testing against sampled production data for smaller feedback loops on data quality checks.
  • Using infrastructure-as-code testing tools to test whether critical resources are in place and correct (see https://www.equalexperts.com/blog/our-thinking/testing-infrastructure-as-code-3-lessons-learnt/ for a discussion of some existing tools).

Hopefully this gives a clearer overview of some of the essential practices needed to create an effective data pipeline. In the next blog post in this series, we finish our series by looking at the many pitfalls you can encounter in a data pipeline project. Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.  

Contact us!

If you’d like us to share our experience of data pipelines with you, get in touch using the form below.

A carefully managed data pipeline can provide you with seamless access to reliable and well-structured datasets.

A generalised form of transferring data from a source system A to a source system B, data pipelines are developed in small pieces, and integrated with data, logic and algorithms to perform complex transformations. To do this effectively, there are some essential practices that need to be adhered to.

In our data pipeline playbook we have identified eleven practices to follow when creating a data pipeline.  Here we touch on six of these practices such as how to start by using a steel thread, and in our next blog post we will talk about iteratively creating your data models as well as observing the pipeline.  Applying these practices will allow you to integrate new data sources faster at a higher quality as outlined in our recent post on the benefits of a data pipeline.

About this series

This is part four in our six-part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. In part three we considered the ‘must have’ key principles of data pipeline projects. Now we look at the six key practices needed for a data pipeline. Before we get into the details we just want to cover off what’s coming in the rest of the series. In part five we look at more of those practices, and in part six we look at the many pitfalls you can encounter in a data pipeline project. 

The growing need for good data engineering

Today, data engineers serve a wider audience than just a few years ago. As there is a growing need for organisations to apply machine learning techniques to their data, new challenges are faced by data engineers in order to remain relevant. Essential to every project is the ability to reliably deliver large-volume data sets so that data scientists can train more accurate models.

Aside from dealing with larger data volumes, these pipelines need to be flexible in order to accommodate the variety of data and the increasingly high processing velocity required. The following practices are those that we feel are essential to successful projects, the minimum requirement for success. They are based on our collective knowledge and experience gained across many data pipeline engagements.  

Practice 1: Build for the right latency

When designing the pipeline, it’s important to consider what level of latency you need. What is your speed of decision? How quickly do you need the data? Building and running a low latency, real-time data pipeline will be significantly more expensive, so make sure that you know you need one before embarking on that path. You should also ask how fast your pipeline can be. Is it even possible for you to have a real-time data pipeline? If all your data sources are produced by daily batch jobs, then the best latency you can reach will be daily updates, and the extra cost of real-time implementations will not provide any business benefits.

If you do need to be within real-time or near real-time, then this needs to be a key factor at each step of the pipeline. The speed of the pipe is conditioned by the speed of the slowest stage.

And be careful not to confuse the need for a real-time decision engine with the need for a real-time historical data store, such as a data warehouse for the data scientists. Decision models are created from stores of historical data and need to be validated before deployment into production. Model release usually takes place at a slower cadence (e.g., weekly or monthly). Of course, the deployed model will need to work on a live data stream, but we consider this part of the application development. This is not the appropriate use for a data warehouse or similar.

Practice 2: Keep raw data

Ingestions should start by storing raw data in the pipeline without making any changes. In most environments, data storage is cheap, and it is common to have all the ingested data persisted and unchanged. Typically, this is done via cloud file storage (S3, GCP Cloud Storage, Azure Storage), or HDFS for on-premise data.

Keeping this data allows you to reprocess it without re-ingestion if any business rule changes, and it also retains the possibility of new pipelines based on this data if, for example, a new dashboard is needed.

Practice 3: Break transformations into small tasks

Pipelines are usually composed of several transformations of the data, activities such as format validation, conformance against master data, enrichment, imputation of missing values, etc. Data pipelines are no different from other software and should thus follow modern software development practices of breaking down software units into small reproducible tasks. Each task should target a single output and be deterministic and idempotent. If we run a transformation on the same data multiple times, the results should always be the same.

By creating easily tested tasks, we increase the quality and confidence in the pipeline, as well as enhance the pipeline maintainability. If we need to add or change something on the transformation, we have the guarantee that if we rerun it, the only changes will be the ones we made.

Practice 4: Support backfilling

If the pipelines are mature at the start of development, it may not be possible to fully evaluate whether the pipeline is working correctly or not. Is this metric unusual because this is what always happens on Mondays, or is it a fault in the pipeline? We may well find at a later date that some of the ingested data was incorrect. Imagine you find out that during a month, a source was reporting incorrect results, but for the rest of the time, the data was correct.

We should engineer our pipelines so that we can correct them as our understanding of the dataflows matures. We should be able to backfill the stored data when we have identified a problem in the source or at some point in the pipeline, and ideally, it should be possible to backfill just for the corresponding period of time, leaving the data for other periods untouched.

Practice 5: Start with a steel thread

When starting at a greenfield site, we typically build up data pipelines iteratively around a steel thread – first a thin data pipe which is a thin slice through the architecture. This progressively validates the quality and security of the data. The first thread creates an initial point of value – probably a single data source, with some limited processing, stored where it can be accessed by at least one data user. The purpose of this first thread is to provide an initial path to data and uncover unexpected blockers, so it is selected for simplicity rather than having the highest end-user value. Bear in mind that in the first iteration, you will need to:

  • Create a cloud environment which meets the organisation’s information security needs.
  • Set up the continuous development environment.
  • Create an appropriate test framework.
  • Model the data and create the first schemas in a structured data store.
  • Coach end users on how to access the data.
  • Implement simple monitoring of the pipeline.

Later iterations will bring in more data sources and provide access to wider groups of users, as well as bringing in more complex functionality such as:

  • Including sources of reference or master data.
  • Advanced monitoring and alerting.

Practice 6: Utilise cloud – define your pipelines with infrastructure-as-code

Pipelines are a mixture of infrastructure (e.g., hosting services, databases, etc.), processing code, and scripting/configuration. They can be implemented using proprietary and/or open-source technologies. However, all of the cloud providers have excellent cloud native services for defining, operating and monitoring data pipelines. They are usually superior in terms of their ability to scale with increasing volumes, simpler to configure and operate, and support a more agile approach to data architecture.

Whichever solution is adopted, since pipelines are a mixture of components, it is critical to adopt an infrastructure-as-code approach. Only by having the pipeline defined and built using tools, such as terraform, and source controlled in a repository, will pipeline owners have control over the pipeline and the confidence to rebuild and refine it as needed.

Hopefully this gives a clearer overview of some of the essential practices needed to create an effective data pipeline. In the next blog post in this series, we will outline more of the practices needed for data pipelines.  Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.  

Contact us!

If you’d like us to share our experience of data pipelines with you, get in touch using the form below.

 

Facing an ever-growing set of new tools and technologies, high functioning analytics teams have come to rely increasingly on data engineers. Building and managing production data engineering pipelines is an inherently complex process, which can prove hard to scale without a systematic approach.

To help navigate this complexity, we have compiled our top advice for successful solutions. Here we examine some of the key guiding principles to help data engineers (of all experience levels) effectively build and manage data pipelines. These have been compiled using the experience of the data engineers at Equal Experts. They collectively recommend the adoption of these principles as they will help you lay the foundation to create sustainable and enduring pipelines.  

About this series

This is part three in our six part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. Now we consider the “must have” key principles of data pipeline projects. Before we get into the details, we just want to cover off what’s coming in the rest of the series. In part four, we look at the six key practices needed for a data pipeline. In part five we investigate more of those practices, and in part six we look at the many pitfalls you can encounter in a data pipeline project. 

The growing need for good data engineering

If I have learned anything from my years working as a data engineer, it is that practically every data pipeline fails at some point. Broken connections, broken dependencies, data arriving too late, or unreachable external systems or APIs. There are many reasons. But, regardless of the cause, we can do a lot to mitigate the impact of a data pipeline’s failure. These ‘must have’ principles are built up over the years to help to ensure that projects are successful. They are based on my knowledge, and the Equal Experts team’s collective experience, gained across many data pipeline engagements.  

Data pipelines are products

Pipelines bring data from important business sources. In many cases, they feed reports and analyses that endure for a long time. Unless your business does not expect to alter how it operates, or there are no amendments to low-level processes, the data pipelines will always need to adapt to the changes in the fundamental processes, new IT, or the data itself.  As something that should respond to and embrace regular change, pipelines should be treated as products rather than projects.

This means that there should be multi-year funding to monitor and maintain the existing pipelines. Providing headroom to add new ones, and supporting the analysis and retirement of old ones. Pipelines need product managers to understand the pipelines’ current status and operability, and to prioritise the work. (See this Forbes article for a wider description of working in product-mode over project-mode.)

Find ways for making common use of the data

The data collected for a given problem or piece of analysis will nearly always be useful in answering other questions. When creating pipelines, we try to architect them in a way that allows reuse, whilst also remaining lean in our implementation choices.

In many cases there are simple ways of achieving this. For example, there are usually a variety of places where data is stored in the pipeline. Raw ingested data might be useful for unanticipated purposes. And it can often be made available to skilled users by providing them access to the landing zone.

Appropriate identity and access technologies, such as role-based access, can support reuse while permitting strict adherence to data-protection policies and regulations. The fundamental architecture can stay the same, with access being provided by adding or amending access roles and permissions to data buckets, databases or data warehouses.

A pipeline should operate as a well-defined unit of work

Pipelines have a cadence driven by the need for decision-making and limited by the availability of source data. The developers and users of a pipeline should understand and recognise this as a well-defined unit of work – whether every few seconds, hourly, daily, monthly or event-driven.

Pipelines should be built around use cases

In general, we recommend building pipelines around the use case rather than the data source. This will help ensure that business value is achieved early. In some cases, the same data source might be important to several use cases, each with different cadences and access rights. Understanding when to reuse parts of pipelines and when to create new ones is an important consideration. For example, faster pipelines can always be used for slower cadences, but it typically requires more effort to maintain and adapt them. It might be simpler to create a simpler batch pipeline to meet a new low-latency use case that is not expected to change substantially than to focus on upgrading a fast-streaming pipe to meet the new requirements. 

Continuously deliver your pipelines

We want to be able to amend our data pipelines in an agile fashion as the data environment and needs of the business change. So, just like any other piece of working software, continuous delivery practices should be adopted to enable continuous updates of data pipelines in production. Adopting this mindset and these practices is essential to support continuous improvement and create feedback loops that rapidly expose problems and address user feedback.

Consider how you name and partition your data

Data pipelines are a mix of code and infrastructure that can become confusing as they grow if care is not taken with the naming. Pipelines will include at least a set of databases, tables, attributes, buckets, roles, etc., and they should be named in a consistent way to facilitate understanding and maintenance of the pipelines, as well as make the data meaningful to the end-users.

In many architectures, naming will directly affect how your data is partitioned, which in turn affects the speed of the search and retrieval of data. Consider what will be the most frequent queries when specifying bucket names, table partitions, shards, and so on.

Want to know more?

These guiding principles have been born out of our engineers and use each of their 10+ years of data engineering for end-to-end machine learning solutions. We are sure there are lots of other principles, so please do let us know of any approaches you have found effective in managing data pipelines. 

In our next blog post in this series we will start laying out some of the key practices of data pipelines.  Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.  

Contact us!

If you’d like us to share our experience of data pipelines with you, get in touch using the form below.

 

The six main benefits of an effective data pipeline

When you think of the technology tools that power a successful business, a data pipeline isn’t always at the top of the list. Because, although most forward thinking companies now realise data is one of their most valuable assets, the importance of data engineering is often underestimated. 

Yet modern data pipelines enable your business to quickly and efficiently unlock the data within your organisation. They allow you to extract information from its source, transform it into a usable form, and load it into your systems where you can use it to make insightful decisions. Do it well and you will benefit from faster innovation, higher quality (with improved reliability), reduced costs, and happy people. Do it badly, and you could lose a great deal of money, miss vital information or gain completely incorrect information.

In this article we look at how a successful data pipeline can help your organisation, as we attempt to unpack and understand the benefits of data pipelines.

About this series

This is part two in our six part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Now we look at the six main benefits of an effective data pipeline. Before we get into the details, we just want to cover off what’s coming in the rest of the series. In part three we consider the ‘must have’ key principles of data pipeline projects, parts four and five cover the essential practices of a data pipeline. Finally, in part six we look at the many pitfalls you can encounter in a data pipeline project. 

The benefits of a great data pipeline

Simply speaking, a data pipeline is a series of steps that move raw data from a source to a destination. In the context of business intelligence, a source could be a transactional database. The destination is where the data is analysed for business insights. In this journey from the source to the destination, transformation logic is applied to data to make it ready for analysis. There are many benefits to this process, here are our top six.

1 – Replicable patterns
Understanding data processing as a network of pipelines creates a way of thinking that sees individual pipes as examples of patterns in a wider architecture, which can be reused and repurposed for new data flows.

2 – Faster timeline for integrating new data sources
Having a shared understanding and tools for how data should flow through analytics systems makes it easier to plan for the ingestion of new data sources, and reduces the time and cost for their integration.

3 – Confidence in data quality

Thinking of your data flows as pipelines that need to be monitored and also be meaningful to end users, improves the quality of the data and reduces the likelihood of breaks in the pipeline going undetected.

4 – Confidence in the security of the pipeline

Security is built in from the first pipeline by having repeatable patterns and a shared understanding of tools and architectures. Good security practices can be readily reused for new dataflows or data sources.

5 – Incremental build
Thinking about your dataflows as pipelines enables you to grow your dataflows incrementally. By starting with a small manageable slice from a data source to a user, you can start early and gain value quickly.

6 – Flexibility and agility
Pipelines provide a framework where you can respond flexibly to changes in the sources or your data users’ needs.
Designing extensible, modular, reusable Data Pipelines is a larger topic and very relevant in Data Engineering. In the next blog post in this series, we will outline the principles of data pipelines.  Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.  

Contact us!
If you’d like us to share our experience of data pipelines with you, get in touch using the form below.

 

It is common to hear that ‘data is the new oil,’ and whether you agree or not, there is certainly a lot of untapped value in much of the data that organisations hold.

Data is like oil in another way – it flows through pipelines. A data pipeline ensures the efficient flow of data from one location to the other. A good pipeline allows your organisation to integrate new data sources faster, provide patterns that you can replicate, gives you confidence in your data quality, and builds in security. But, data flow can be precarious and, when not given the correct attention, it can quickly overwhelm your organisation. Data can leak, become corrupted, and hit bottlenecks and, as the complexity of the requirements grow, and the number of data sources multiplies, these problems increase in scale and impact.

About this series

This is part one in our six part series on the data pipeline, taken from our latest playbook. Here we look at the very basics – what is a data pipeline and who is it used by? Before we get into the details, we just want to cover off what’s coming in the rest of the series. In part two, we look at the six main benefits of a good data pipeline, part three considers the ‘must have’ key principles of data pipeline projects, and parts four and five cover the essential practices of a data pipeline. Finally, in part six we look at the many pitfalls you can encounter in a data pipeline project. 

Why is a data pipeline critical to your organisation?

There is a lot of untapped value in the data that your organisation holds. Data that is critical if you take data analysis seriously. Put to good use, data can identify valuable business insights on your customers and your operations. However, to find these insights, the data has to be regularly, or even continuously, transported from the place where it is generated to a place where it can be analysed.

A data pipeline, consolidates data from all your disparate sources into one (or multiple) destinations, to enable quick data analysis. It also ensures consistent data quality, which is absolutely crucial for reliable business insights. 

So what is a data pipeline?

A data pipeline is a set of actions that ingest raw data from disparate sources and move the data to a destination for storage and analysis. We like to think of this transportation as a pipeline because data goes in at one end and comes out at another location (or several others). The volume and speed of the data are limited by the type of pipe you are using and pipes can leak – meaning you can lose data if you don’t take care of them.

The data engineers who create a pipeline are a critical service for any organisation. They create the architectures that allow the data to flow to the data scientists and business intelligence teams, who generate the insight that leads to business value.

A data pipeline is created for data analytics purposes and has:

Data sources – These can be internal or external and may be structured (e.g., the result of a database call), semi-structured (e.g., a CSV file or a Google Sheets file), or unstructured (e.g., text documents or images).

Ingestion process – This is the means by which data is moved from the source into the pipeline (e.g., API call, secure file transfer).

Transformations – In most cases, data needs to be transformed from the input format of the raw data, to the one in which it is stored. There may be several transformations in a pipeline.

Data quality/cleansing – Data is checked for quality at various points in the pipeline. Data quality will typically include at least validation of data types and format, as well as conforming with the master data.

Enrichment – Data items may be enriched by adding additional fields, such as reference data.

Storage – Data is stored at various points in the pipeline, usually at least the landing zone and a structured store (such as a data warehouse).

End users – more information on this is in the next section.

So, who uses a data pipeline?

We believe that, as in any software development project, a pipeline will only be successful if you understand the needs of the users. 

Not everyone uses data in the same way. For a data pipeline, the users are typically:

Business intelligence/management information analysts, who need data to create reports; 

Data scientists who need data to do an in-depth analysis of point problems or create algorithms for key business processes (we use ‘data scientist’ in the broadest sense, including credit risk analysts, website analytics experts, etc.)

Process owners, who need to monitor how their processes are performing and troubleshoot when there are problems.

Data users are skilled at visualising and telling stories with data, identifying patterns, or understanding significance in data. Often they have strong statistical or mathematical backgrounds. And, in most cases, they are accustomed to having data provided in a structured form – ideally denormalised – so that it is easy to understand the meaning of an individual row of data without the need to query separate tables or databases.

Is a data pipeline a platform?

Every organisation would benefit from a place where they can collect and analyse data from different parts of the business. Historically, this has often been met by a data platform, a centralised data store where useful data is collected and made available to approved people. 

But, whether they like it or not, most organisations are, in fact, a dynamic mesh of data connections which need to be continually maintained and updated. Following a single platform pattern often leads to a central data engineering team tasked with implementing data flows. 

The complexities of meeting everyone’s needs and ensuring appropriate information governance, as well as a lack of self-service, often make it hard to ingest new data sources. This can then lead to backlog buildup, frustrated data users, and frustrated data engineers. 

Thinking of these dataflows as a pipeline changes the mindset away from monolithic solutions, to a more decentralised way of thinking – understanding what pipes and data stores you need and implementing them the right way for that case whilst reusing where appropriate.

So now we have understood a little more about the data pipeline, what it is and how it works, we can start to understand the benefits and assess whether they align with your digital strategy.  We cover these in the next blog article, ‘What are the benefits of data pipelines?’

For more information on the data pipeline in general, take a look at our Data Pipeline Playbook.  And if you’d like us to share our experience of the data pipeline with you, get in touch using the form below.