Unit testing in dbt – from an experiment to an open-source framework
In the following article, I want to share our journey to introduce the concept of unit testing in the framework dbt. There were a couple of existing efforts in the community but none as we envisioned – writing unit tests in SQL with a fast-feedback loop so that we can even use it for tdd.
Fellow colleague Pedro Sousa and I have published a couple of articles about our journey – we described our first experiment and shared our second and polished approach. After the blogs, a couple of teams at Equal Experts started to use our strategy and give us feedback.
As mentioned in one of the articles, we always thought dbt should have support for unit tests. We asked dbt’s team about the roadmap to support unit tests and we found it unlikely that it was going to happen. Also, they think it makes more sense as an external framework. Personally, it doesn’t make much sense to me, one can argue that if we look at programming languages, we are used to having the testing capabilities as external libraries, but dbt is not a programming language and it already supports other types of tests.
After a couple of conversations with the other teams, we were encouraged to use our work to create the dbt-unit-testing framework under the Equal Experts Github.
We released the framework three months ago, and since then we’ve started to have traction on our Github. Currently, we have 47 stars and 45 closed pull requests, and we have approximately 120 unique visitors per 14 days. The best outcome is having people collaborating with us, giving feedback, creating issues and developing pull requests. We already have four community contributors and we are proud to say that we appreciate all the work and the effort – @halvorlu, @darist, @charleslr and @gnilrets.
Community collaboration and feedback are crucial to improve the framework and prioritise what should be done. We have a couple of ideas in the backlog, such as adding support for more data sources, but we don’t yet have a clear roadmap. We prefer to listen to the feedback and work based on that. Continuous improvement through continuous user feedback perfectly describes our mindset.
This post shares our journey, mindset and appreciation for the open-source community engagement in such small projects.
You can check the framework here: https://github.com/EqualExperts/dbt-unit-testing
Contributing to tech communities is very much part of our mission at Equal Experts.
In my opinion, ‘data is the new oil’ is a metaphor that should be used with caution, especially by those who wish to portray data in a positive light.
That is, whilst there are many similarities between data and oil, most are unflattering. I believe that by confronting these negative connotations, we can have the right conversations about our responsibilities in the Age of Data, whilst finding better metaphors to describe them.
A Brief History
In the beginning was Clive Humby, the British data scientist, who coined the phrase ‘data is the new oil’ back in 2006. It has since become part of the business and management lexicon, repeated by journalists, policy makers and world leaders alike. In common usage, the metaphor emphasises the fact that oil and data are critical parts of the modern global economy, with the latter gradually replacing the former. Humby also recognised that data, like oil, has no intrinsic value, and expensive processes of refinement have to be applied before they become valuable.
Certainly, data powers much of the economy, just as oil powers our engines. Much of what we do online is part of a Faustian pact, in which we allow the tech giants to harvest our data in exchange for useful, free tools such as email. Tech evangelists minimise the costs whilst emphasising the benefits of data in our lives. But if we ever stopped to actually consider how much personal information we give away each day we’d put our laptops in the freezer. And comparing data to oil has a dark side. Oil is a dirty business. Oil-based products – petrol, plastics, chemicals – are harming the planet. Put simply, this isn’t the kind of company data should want to keep.
Oil Spills and Data Leaks
Let’s look at one of the most regrettable similarities between data and oil.
As oil moves around the globe, leaks happen (there have been 466 large oil spills in the last 50 years). Much has been said and written about these disasters and the environmental damage they cause. Coupled with the growing apprehension of the role oil plays in the global climate crisis, you might expect the demand for oil to be falling like a stone. You’d be wrong.
Figure 1: Global Oil Production 1999-2020
And if that graph surprises you, consider this:
Figure 2: Number of monthly active Facebook users worldwide
If you’re looking for indicators of decline following Facebook’s equivalent to Exxon Valdez – the Cambridge Analytica scandal – you won’t find it. Lest we forget, Cambridge Analytica harvested upwards of 87 million Facebook users’ personal data without their consent, then sold that data to political consultancies. This dubious practice may well have affected the outcome of the 2016 US Presidential election, and the Brexit vote in the UK the same year. But despite #deletefacebook and some social and political huffing at the time, the scandal didn’t make a dent on Facebook’s fortunes.
So tech giants and oil barons are alike, in that they leak and pollute and behave with disregard for the wider community, without much consequence.
Oil and Water
The question then becomes, is there a better metaphor out there? During my research I’ve happened across plausible arguments in favour of a cataclysmic comparison – that is to say, data is the new nuclear power (awesomely powerful, yet capable of dreadful contamination and destruction). When discussing this piece with a leading practitioner, he reminded me that data ‘flows’ from one place to another, and suggested that it’s like water (it’s nourishing and necessary – but needs filtering and processing to be safe; it can leak), or slightly less appetisingly, data is like blood.
All are decent metaphors (I particularly like the ‘water’ alternative). However, water (like uranium, or blood) is physical – if I buy and drink a litre of water, no one else can drink that same litre – whereas data can be used simultaneously in different places, at multiple times in multiple ways. And data is unique, whereas one glass of water is essentially the same as any other.
If we stick with data being like oil, we’re left with harrowing images of sick seabirds and bleached reefs. Which prompts me to ask: are we in danger of losing something valuable, by tarring data with the same oily brush?
Data for Good
Last year, academics at the University of Oxford interrogated a massive dataset to assess the effectiveness of a range of potential treatments for Covid-19. Using advanced data science techniques, they discovered an unexpected pattern – namely, a drug used in the treatment of rheumatoid arthritis could save lives, reduce the need for a ventilator, and shorten patients’ stay in hospital. Such a breakthrough should be seen as an unalloyed success story for all those involved, whilst also containing within it some valuable lessons about how we treat data.
The most important, from my perspective, is that the data sets were held securely by NHS Digital, after full consent was granted by those involved. Not one item of data was taken without express permission, or used for any other purpose than that for which the data was originally sought. In other words, the data was willingly and knowingly given for a specific and transparent purpose. Safeguards were put in place, adhered to, and all parties acted responsibly throughout. Why can’t all data be used in this way?
Ultimately, the NHS Digital story, and others like it, reinforce the importance of the concept of ‘Data Guardianship’. That is, all actors in our data-rich economy need to take responsibility for minimising the damage their actions cause in the present, whilst making every reasonable effort to safeguard the future. The three pillars of Data Guardianship are:
- Organisations shouldn’t gather any data that might expose the subject to excessive privacy risks, now or in the future
- Data should not be hoarded ‘just in case’ – organisations should refuse to keep anything they don’t need
- Organisations should be proactive in explaining what data they’re collecting, how they intend to use it, and what rights the data subject has, in order to enable better decisions around consent
Ultimately, we have to make sure data doesn’t become the new oil, and instead find a metaphor that emphasises the positive values that underpin these pillars, instead of contradicting them. We can’t simply hope that some future phenomenon will make our data safe from abuse – we all need to educate ourselves, and then act accordingly, today. And if we can’t trust companies to behave responsibly, we shouldn’t give them our data in the first place.
Perhaps we should think of our data as a vote that we cast in support of those organisations that are behaving best in the data-based economy. In fact, maybe that’s the new metaphor I’ve been searching for all along: data is the new democracy.
Facing an ever-growing set of new tools and technologies, high functioning analytics teams have come to rely increasingly on data engineers. Building and managing production data engineering pipelines is an inherently complex process, which can prove hard to scale without a systematic approach.
To help navigate this complexity, we have compiled our top advice for successful solutions. Here we examine some of the key guiding principles to help data engineers (of all experience levels) effectively build and manage data pipelines. These have been compiled using the experience of the data engineers at Equal Experts. They collectively recommend the adoption of these principles as they will help you lay the foundation to create sustainable and enduring pipelines.
About this series
This is part three in our six part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. Now we consider the “must have” key principles of data pipeline projects. Before we get into the details, we just want to cover off what’s coming in the rest of the series. In part four, we look at the six key practices needed for a data pipeline. In part five we investigate more of those practices, and in part six we look at the many pitfalls you can encounter in a data pipeline project.
The growing need for good data engineering
If I have learned anything from my years working as a data engineer, it is that practically every data pipeline fails at some point. Broken connections, broken dependencies, data arriving too late, or unreachable external systems or APIs. There are many reasons. But, regardless of the cause, we can do a lot to mitigate the impact of a data pipeline’s failure. These ‘must have’ principles are built up over the years to help to ensure that projects are successful. They are based on my knowledge, and the Equal Experts team’s collective experience, gained across many data pipeline engagements.
Data pipelines are products
Pipelines bring data from important business sources. In many cases, they feed reports and analyses that endure for a long time. Unless your business does not expect to alter how it operates, or there are no amendments to low-level processes, the data pipelines will always need to adapt to the changes in the fundamental processes, new IT, or the data itself. As something that should respond to and embrace regular change, pipelines should be treated as products rather than projects.
This means that there should be multi-year funding to monitor and maintain the existing pipelines. Providing headroom to add new ones, and supporting the analysis and retirement of old ones. Pipelines need product managers to understand the pipelines’ current status and operability, and to prioritise the work. (See this Forbes article for a wider description of working in product-mode over project-mode.)
Find ways for making common use of the data
The data collected for a given problem or piece of analysis will nearly always be useful in answering other questions. When creating pipelines, we try to architect them in a way that allows reuse, whilst also remaining lean in our implementation choices.
In many cases there are simple ways of achieving this. For example, there are usually a variety of places where data is stored in the pipeline. Raw ingested data might be useful for unanticipated purposes. And it can often be made available to skilled users by providing them access to the landing zone.
Appropriate identity and access technologies, such as role-based access, can support reuse while permitting strict adherence to data-protection policies and regulations. The fundamental architecture can stay the same, with access being provided by adding or amending access roles and permissions to data buckets, databases or data warehouses.
A pipeline should operate as a well-defined unit of work
Pipelines have a cadence driven by the need for decision-making and limited by the availability of source data. The developers and users of a pipeline should understand and recognise this as a well-defined unit of work – whether every few seconds, hourly, daily, monthly or event-driven.
Pipelines should be built around use cases
In general, we recommend building pipelines around the use case rather than the data source. This will help ensure that business value is achieved early. In some cases, the same data source might be important to several use cases, each with different cadences and access rights. Understanding when to reuse parts of pipelines and when to create new ones is an important consideration. For example, faster pipelines can always be used for slower cadences, but it typically requires more effort to maintain and adapt them. It might be simpler to create a simpler batch pipeline to meet a new low-latency use case that is not expected to change substantially than to focus on upgrading a fast-streaming pipe to meet the new requirements.
Continuously deliver your pipelines
We want to be able to amend our data pipelines in an agile fashion as the data environment and needs of the business change. So, just like any other piece of working software, continuous delivery practices should be adopted to enable continuous updates of data pipelines in production. Adopting this mindset and these practices is essential to support continuous improvement and create feedback loops that rapidly expose problems and address user feedback.
Consider how you name and partition your data
Data pipelines are a mix of code and infrastructure that can become confusing as they grow if care is not taken with the naming. Pipelines will include at least a set of databases, tables, attributes, buckets, roles, etc., and they should be named in a consistent way to facilitate understanding and maintenance of the pipelines, as well as make the data meaningful to the end-users.
In many architectures, naming will directly affect how your data is partitioned, which in turn affects the speed of the search and retrieval of data. Consider what will be the most frequent queries when specifying bucket names, table partitions, shards, and so on.
Want to know more?
These guiding principles have been born out of our engineers and use each of their 10+ years of data engineering for end-to-end machine learning solutions. We are sure there are lots of other principles, so please do let us know of any approaches you have found effective in managing data pipelines.
In our next blog post in this series we will start laying out some of the key practices of data pipelines. Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.
If you’d like us to share our experience of data pipelines with you, get in touch using the form below.