I’ve been interviewing our data engineers to understand how data engineering is seen these days, to find how people become data engineers, and to find how data engineering overlaps with software engineering.
I want to start by thanking all the data engineers for giving their time to do this with me. During the series, I was able to mature my own ideas about data engineering by learning from others experiences. I believe this series allows us to have a more common and transparent view of data engineering. Also, I really hope I have encouraged more software engineers to join the data field.
For this final post, I’m sharing my own takeaways from the series.
Data engineering is a specialism
One aspect I want to recall is that there are two kinds of data – application data and analytical data. Application data is the data that allows a business to run, and analytical data is the data that optimizes a business – when we are speaking about data engineering we are mostly thinking about analytical data.
There are two major contributors that make data engineering a separate and specialised silo in organisations. One, the analytical data is taken into consideration far down the road. And two, the learning curve to work with some of the available data tools also contributes to the specialism, for instance when companies started to use big data processing tools like Hadoop, it was a whole complex new ecosystem.
However, I do believe that organisations are starting to see data as a crucial part of the businesses as a first-class concern, and to make this effective, the data work should be considered at the application level. Data mesh was referenced by multiple interviewees. It’s an emerging architecture that is starting to change the way organisations think and work with data. To make analytical data a first-class concern means that the engineers who work at the application level will also need to be aware of analytical data. The adoption of cross-functional teams, expanding data literacy for the nontechnical roles like product owners, and the evolution of data tools are fundamental for this to happen.
Alongside this evolution, we might start to see a generalisation of data engineering into the other engineering roles if the data tooling also evolves and the learning curve flattens. Although, nowadays we see data engineering being considered by the industry as a specialism.
From a software engineer to a data engineer journey
Apart from cases where people got exposure to data during their academic careers, the majority of the engineers that I interviewed shared that they were software engineers and they started to work on a data project by preference, by chance, or by accident.
I would say the journey isn’t effortless, the data landscape can be overwhelming and it’s a different context, but as we’ve seen during the series, the engineering skills are the same.
We’ve seen SQL mastery being recommended as one of the skills that every data engineer should have, and I fully agree with it, due to the cloud data warehouses being used more than ever for data workloads.
One of the interviewees mentioned that concepts might be more important than skills and I absolutely agree. So here is a list of concepts I consider fundamental:
- Understanding storage formats and the applicability and particularities for each type
- Data processing at scale
- Stream processing
- Understanding GDPR, data security, and privacy
- Awareness of the data landscape and capabilities of cloud providers
- Adaptability to a new environment and mindset is key, not just from a technology perspective, but from a business
I stated that this is fundamental but it doesn’t mean that an engineer needs all of this before enrolling in a data engagement. Please keep in mind that there is no one who knows everything, and learning on the job happens often and gracefully.
Lack of standards in data
With the massive growth of data, data tooling has been evolving at a fast pace. In 2004 Google introduced MapReduce as a programming model to handle large amounts of data, then the Hadoop ecosystem was created afterward and it became mainstream and slightly overused.
Coexisting with the batch world there’s the streaming world, to answer problems with low latency, near real-time, which is called real-time analytics. At some point in time there was an explosion of interest in streaming and organisations started to adopt and overuse it, sometimes when there wasn’t a real gain in having real-time analytics.
The evidence is that often the technology is evolving and companies are being able to scale and become efficiently data-driven, but other times unnecessary complexity is introduced without a proper use case or a real need. A good example is when the data lake architecture became mainstream, some data lakes ended up as data swamps – a dumping ground for data, but without the tools to use it.
We’re starting to see another pattern around the use of cloud data warehouses like BigQuery or Snowflake, which were referenced in multiple interviews. These modern data warehouses are scalable and cost-effective, easy to start, and they have started to be applied to more use cases than ever. This was reflected in the data landscape by the proliferation of tools that connect to the data warehouses to manage data. You might see this being referred to as the Modern Data Stack, which is a set of tools and strategies to manage data using the cloud data warehouses as the central place to store and process the data.
The problem with the Modern Data Stack is that it’s not a well-defined stack, it’s just a term used to refer to data tools in this space, and there are a lot of them. Although we’re seeing organisations slowly converging to specific tools, for example dbt (data build tool) is becoming a pattern to handle transformations of data using SQL on top of the data warehouse.
That being said, the data space is evolving at a fast pace to cope with the data growth and it makes organisations try different strategies, sometimes driven by value, other times driven by technology. With the emergence of cloud data warehouses we are starting to see patterns, strategies, and tools that allow us to hopefully have a slightly less complex data world.