After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed that the majority of the data engineers I spoke to were experienced software engineers.
Based on this, I decided to create a new blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering and encourage more software engineers to become data engineers.
This week the interview is with Davd Pardoe.
What is data engineering for you and how does it overlap with software engineering?
I have worked in the domain of data science and artificial intelligence for over 25 years. Data has been the driving force for every piece of work I have ever done. Manipulating data is a necessity for any kind of data science work. Most of the work, however, is typically done in an “offline” sense – the work is done at a snapshot in time. This means that the data engineering work can be done as it is required, and there is no real concept of “productionising” the work. This is where my experience of data engineering and software engineering work diverge. Data engineering in the data science space is about exploring something, whereas software engineering is carried out in order to build a working piece of software. This means that some of the critical aspects of best practice software engineering, such as continuous integration and continuous delivery, are not considered when carrying out data engineering for data science.
How did you get involved in data engineering?
Even from the early days of my career I was aware of the importance of the data engineering work I did to be repeatable. Either for similar projects or to repeat work (i.e. “productionising” it). To this end, I always endeavoured to fully comment the code I wrote and structure it well. In addition, many data science projects I have done have created datasets that served many other purposes and, therefore, needed to be delivered in a robust, repeatable way. This has meant I have drifted towards embracing software engineering best practices.
What are the skills a data engineer must have that a software engineer usually doesn’t have?
It is absolutely key to have an in-depth understanding of data, and not just the code that manipulates it. I have retained a single text book from my university days: SSADM (Structured Systems Analysis and Design Method). This book details how you represent a business and its core business processes in data terms. It discusses data entities and how they are related to each other through primary key – foreign key relationships. Even in the emergence of newer ways of storing and accessing data this knowledge is the fundamental building block for how to manipulate data. Of course, at the heart of this is understanding SQL. I was using SQL from day one of my very first job, and that has been the foundation for so much of what I have ended up doing. If a data engineer is given some datasets with little information (especially not a data dictionary!), they should be able to analyse them in order to understand them and answer fundamental questions like: “What does each row actually represent?” and “How do I uniquely identify each row?” All too often I have seen software engineers join datasets and end up duplicating data and not knowing why.
What data trends are you keeping an eye on?
In my view, streaming ETL is perhaps the most significant development in data engineering right now. This is a necessity to ensure data is made available in the most usable structure and format, and as up-to-date as possible. Historically, much transformation of data was done on a batch basis and transformed data was therefore only available for a snapshot in time that is immediately out-of-date. This is perfect for analytical and reporting purposes, but not for responding to data operating in real time.
Do you have any recommendations for software engineers who want to be data engineers?
It is essential to understand how data represents business processes. Consider the business process and think about the data that is created (or changed) at each step of the process. This translates to getting a thorough understanding of relational data concepts and SQL (not that you will necessarily be coding in it). This includes such things as knowing how to handle many-to-many relationships between entities, and knowing what inner and outer joins are. It is also important to understand how to transform data from one entity level to another. This is more than just aggregating, but also things like flagging if a ‘parent’ record has at least one ‘child’ record of a particular type. It will also help if you learn the specifics of how to apply software engineering best practices in the context of data engineering. Develop the skills to be able to test data engineering code. This is primarily done by querying or visualising the data before and after.