After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.
Based on this, I decided to create a new blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering and it might encourage more software engineers to become data engineers.
This week the interview is with Gavin Campbell.
What is data engineering for you and how does it overlap with software engineering?
Data engineering is all the stuff that happens to our data between the point where it is created and the point where it can be used for something valuable. In the past, this mostly involved using graphical ETL tools to extract data from the relational databases backing business applications or websites, transform it into a more or less dimensional schema, and load it into another database, commonly known as a data warehouse. The consumers of this data would then analyse this data using graphical reporting tools. Before the discovery of data engineering, this entire field of endeavour was known as business intelligence.
In recent years, the range of sources for potentially valuable data has expanded greatly to include data from website interactions, data from mobile devices, data only available from third-party APIs, and many more. The range of potential uses of this data has also expanded greatly, from pre-formatted reports to interactive dashboards, to consume the outputs of data analysis in the applications that generate the data – in recommender systems for example.
In parallel with this, there has been a creeping realisation that graphical tools don’t lend themselves to easy versioning, testing, packaging and deployment. Fortunately for the data engineers, the closely related field of software engineering solved most of these problems a very long time ago, provided that the code in question can be represented as text files. Accordingly, most modern data engineering involves writing code in a text editor, much like software engineering, and writing automated tests and deployment pipelines to deliver this code.
In short, data engineering is writing code that wrangles data using the practices that we know to deliver better outcomes in software delivery.
How did you get involved in data engineering?
I have spent most of my career skipping back and forward between software development, “data”, and the all-important “DevOps”. Having started as a fairly incompetent C/C++ programmer a very long time ago, I drifted into database administration at a time when database administration involved star-point screwdrivers and soldering irons. From there it was a natural progression into business intelligence and the realisation that the reason all this stuff was so unreliable was that nobody was writing any tests.
I expended a huge amount of energy trying to come up with satisfactory ways to write automated tests for these processes, ranging from SQL server stored procedures to ETL tools, to graphical reporting tools, and now to python notebooks.
I’m not sure that all of this effort has been 100% valuable, and I now think that there are some tools better thought of as end-user tools for which it isn’t worthwhile attempting to implement software delivery techniques. The irony, of course, is that tools like Tableau, PowerBI, and Qlik were all supposed to be end-user tools, yet the job listing websites are full of advertisements for Tableau, PowerBI, and Qlik developers.
This led to an increased amount of work helping data teams implement automated testing and deployment, during which I have dipped in and out of working with teams who work on actual websites that do useful things. I feel this has helped me understand what “good” looks like when working with data teams.
What are the skills a data engineer must have that a software engineer usually doesn’t have?
I think the hardest thing for software engineers is that data engineering almost always involves some kind of external platform, such as PostgreSQL, or Apache Spark, or Snowflake. What this means is that the way you write your own code can have dramatic effects on the performance or other behaviour of the platform, sometimes in non-obvious ways. These platforms also change the way we think about testing, since if almost all your code is going to be executed by Apache Spark, writing a complex mock for Apache Spark itself may not be the most valuable activity.
There are also certain types of change that are very expensive to make – generally, those involving a lot of data movement – which is a consideration that doesn’t apply to deploying web apps, for example. These changes, when necessary, need to be identified ahead of time.
What data trends are you keeping an eye on?
I think the data tools space currently consists of a lot of vendors attempting to eat each other’s lunches by expanding the core capabilities of their tools – such as data modelling, or data transformation – into areas traditionally served by other tools. This has led to a version of the classic “one-stop-shop” vs “best of breed” decision for many projects.
There are also customers who have decided all of this stuff is a bit too difficult and are gravitating towards no-code or low-code solutions. Naturally, there are other vendors rushing to fill this space with graphical tools that suffer from all the same problems as their predecessors from the 1990s.
Do you have any recommendations for software engineers who want to be data engineers?
Software engineers already have most of the technical skills needed to be data engineers. Often the “a-ha” moment comes from finding a problem that is difficult to solve at scale by churning out Java or Scala or Python code and finding a solution using Spark or similar platforms. Also, in a data engineering team, there will be people with strong statistical backgrounds but very little experience in the tools for software delivery, so individuals with a software engineering background can make a significant difference to the success of these teams.