After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.
Based on this, I decided to create a blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering, and it might encourage more software engineers to become one.
This week the interview is with Thorben Louw.
What is data engineering for you and how does it overlap with software engineering?
I’d say that data engineering is a software engineering specialism which takes the best practices from ‘traditional’ software engineering – let’s call that building applications and systems – but focuses these practices on the data domain. I mean stuff like unit testing, automation, CI and CD, telemetry and operability, and even how we deliver things – iterative and incremental approaches and early customer involvement.
There’s definitely a lot of overlap. Software engineers have to deal with data all the time, and data engineers write software all the time, so it’s really a question of degree. Actually, I’m not entirely convinced that it’s helpful to try and make a really definite distinction. For a data engineer, the main difference is that the data is the end-product, and the software that we write is the stuff which moves, shapes and measures data – i.e. data pipelines.
The data that we work with – and the landscape in which it lives and is consumed – really controls the choices we have for implementation technologies and techniques. This can mean a more limited set of tools (platforms, libraries, languages), than might be available to a software engineer writing, say, a microservice or an app.
How did you get involved in data engineering?
My background is in traditional software engineering, but I’ve always had an interest in pattern recognition and machine learning. Over the last few years, I got involved in a few data-heavy projects that involved machine learning, which resulted in my focus shifting to data.
As part of making these machine learning projects work repeatedly at scale, I had to get involved in productionising data pipelines and automating things like data cleaning and preparing training data sets. So I was collaborating with other specialists like data scientists, ML and software engineers to make the right data be in the right place, in the right shape, at the right time, and suddenly found myself doing data engineering.
What are the skills a data engineer must have that a software engineer usually doesn’t have?
There are definitely both skills and concept gaps between the people who specialise in making software that works at scale, the people who analyse data and build models, people who know how to visualise data, and the people who understand how to store, clean, move and prepare data. As a data engineer you’ll probably have to wear all of these hats some of the time, often being the collaborative glue between specialists. So, learning the vocabulary and a little bit of the tooling in each of these domains is useful – but I recognise that that’s also a daunting prospect! As a start, it’s useful to be comfortable with a tool like JupyterLabs that data scientists and analysts use all the time, and perhaps also with a popular machine learning framework like Tensorflow.
A modern data engineer is likely to be writing more complex SQL queries than they might have been used to, and diagnosing performance problems with queries, views etc. This often comes as a surprise to people who think modern data engineering is all about writing Scala code for Spark. It’s worth learning SQL well, and also learning about modern data warehouses and column-store databases and how to leverage their power efficiently. You can often succinctly and trivially achieve things that would be a lot of effort to build yourself.
You might have to learn more about various data modelling strategies for OLAP and OLTP data use-cases.
Learning more about the various kinds of data stores and data processing frameworks available from the various cloud providers is useful, if you haven’t had much reason to yet. Similarly, you might come across some more exotic data formats (like Avro or protobuf), and libraries for working with in-memory data, like Apache Arrow, or DLPack. But this can be different for every data source or project.
Then, there are popular tools and frameworks that classical software engineers might not have had exposure to. Off the top of my head, I can think of orchestration frameworks like Airflow/Prefect/Dagster, various ETL tools, and the trending ELT tools like DBT.
With all of these, I don’t think people should be put off trying out data engineering because they don’t know X! You learn what you need to as you need it.
I think the shift to thinking of data as a product takes a little getting used to. It’s not your code that’s precious – it’s the data it makes available to your end users.
Lastly, getting to know more about ethics and legal requirements around handling data, including legal requirements like GDPR, is a really good thing to do!
What data trends are you keeping an eye on?
I’m watching how the MLOps movement matures. A lot of people have seen great benefits from being able to extract insights from their data, but people embarking on machine learning projects often massively underestimate all the other plumbing work it takes to make things successful. And while the modelling tools and frameworks have now almost become commodified, the work needed to produce and deploy good models consistently and reproducibly is mostly still quite bespoke. This plumbing includes stuff like versioning training data, making data available efficiently and affordably (maybe to custom hardware), measuring data quality, optimising model training and selection, and CI/CD practices around machine learning. I’ve seen estimates that this stuff can be 95% of the effort of an machine learning project!
A particularly interesting thing in this area is the high-end projects that make use of dedicated machine learning hardware (like GPU clusters, Google TPUs, Graphcore IPUs, and systems from the likes of Cerebras, Sambanova and others). Optimising data movement to and from devices is critical and requires a deep understanding of the machine learning models and some understanding of hardware constraints (like memory, networking and disk bandwidth constraints, and new tools that like compilers that optimise models for these platforms). If people continue to train larger and larger models, this specialist skill will become critical, but luckily tools for it are also improving very rapidly.
In-memory computing seems really exciting and might have a big impact on how we load and process data in future.
Another thing is that the vast majority of data available to us for analysis is still unstructured data, and I think tools and libraries for working efficiently with raw text, images, audio and video have come along so quickly in the last decade. It will be amazing to see what the future holds here.
Lastly, I’m quite excited by the emergent data mesh paradigm, which encourages the right kind of decentralisation so that teams structure themselves and their implementations in ways appropriate to their data products. I think it’s our best bet yet for dealing with the rapidly growing data teams and data engineering activities many organisations are starting to struggle with.
Do you have any recommendations for software engineers who want to be data engineers?
Firstly, if data fascinates you, go for it! There’s so much exciting stuff happening in this space, and it all changes pretty fast. So don’t be afraid of just starting – right now! – and learning as you go. That’s pretty much how everyone does it.
I think there’s some vocabulary and perhaps unfamiliar tooling, which can be overwhelming at first and make you feel like some sort of imposter. But, if you have a good heart and a curious mind, you will pick stuff up in no time. There are lots of great resources and awesome blogs and videos.
Also be aware that, in data, there’s plenty of exciting and important stuff happening outside of machine learning and data science – those have just stolen the spotlight for now. Don’t ignore an opportunity because it doesn’t seem like you’ll be doing hip machine learning related stuff.