What’s the difference between a Data Scientist and a Data Engineer?

In the mid 2010’s there was a step change in the rate at which businesses started to focus on gaining valuable insights from data.

As the years have passed, the importance of data management has started to sink in throughout the industry. Organisations have realised that you can build the best models, but if your data isn’t qualitative, your results will be wrong.

There are many, varied job roles within the data space. And I always thought the distinction of the roles were pretty obvious. However, recently a lot has been written about the difference between the different data roles, and more specifically the difference between Data Scientists and Data Engineers. 

I think it’s important to understand that not knowing these differences can be instrumental in teams failing or underperforming with data. Which is why I am writing this article. To attempt to clarify the roles, what they mean, and how they fit together. I hope that this will help you to understand the differences between a Data Scientist and a Data Engineer within your organisation.

What do the Data Engineer and Data Scientist roles involve?

So let’s start with the basics. Data Engineers make data available to the business, and Data Scientists enable decisions to be made with the data. 

Data Engineers, at a senior level, design and implement services that enable the business to gain access to its data. They do this by building systems that automagically ingest, transform and publish data, whilst gathering relevant metadata (lineage, quality, category, etc.), enabling the right data to be utilised.  

Data Scientists not only utilise the data made available, but also uncover additional data that can be combined and processed to solve business problems.  

Both Data Scientists and Data Engineers apply similar approaches to their work.  They identify a problem, they look for the best solution, then they implement the solution. The key difference is the problems they look at and, depending on their experience, the approach taken to solving it.  

Data Engineers like Software Engineers, or even more generally engineers, tend to use a process of initial development, refinement and automation.  

Initial development, refinement and automation explained, with cars.

In 1908 Henry Ford released the Model T Ford. As you can see, it has many of the same features as a modern car – wheels on each corner, a bonnet, a roof, seats, a steering wheel, brakes, gears.  

 

In 1959 the first Mini was released.  It had all the same features as the Model T Ford. However, it was more comfortable, cheaper, easier to drive, easier to maintain, and more powerful. It also incorporated new features like windscreen wipers, a radio, indicators, rear view mirrors. Basically, the car had, over 50 years, been incrementally improved.  

Step forward in time to 2010, and Tesla released the Models S and X. These too have many features we can see in the Model T Ford and the Mini.  But now they also contain some monumental changes.

The internal combustion engine is replaced with electric drive. It has sat-nav, autopilot, and even infotainment. All of which combine to make the car much easier and more pleasurable to drive.

What we are seeing is the evolution of the car from the initial production line – basic but functional – through multiple improvements in technology, safety, economy, driver and passenger comforts. All of which improve the driving experience.  

In other words we are seeing initial development, refinement and automation. A process that Data Engineers and Data Scientists know only too well.

For Data Engineers the focus is on data, getting it from source systems to targets, ensuring the data quality is qualified, the lineage captured, the attributes tagged, and the access controlled. 

What about Data Scientists?  They absolutely follow the same pattern, but they additionally look to develop analytics along the Descriptive, Diagnostic, Predictive, Prescriptive scale.  

So why is there confusion between the Data Scientist and Data Engineer roles?  

There is of course not a single answer but some of the common reasons include:

  • At the start, both Data Scientist and Data Engineers spend a lot of time Data Wrangling. This means trying to get the data into a shape where it can be used to deliver business benefits.
  • At first, the teams are often small and they always work very closely together, in fact, in very small organisations they may be the same person – so it’s easy to see where the confusion might come from.
  • It’s often given to Data Engineers to “productionise” analytics model created by Data Scientists.
  • Many Data Engineers and Data Scientists dabble in each other’s areas, as there are many skills both roles need to employ. These can include data wrangling, automation and algorithms..  

As the seniority of data roles develop, so do the differences.

When I talk to and work with Data Engineers and Data Scientists, I can often categorise them into one of three categories – Junior, Seasoned, Principal – and when I work with Principals, in either space, you can tell they are a world apart in their respective fields.  

So what differentiates the different levels and roles?

That’s it. I hope this article helps you to more easily understand the differences between a Data Scientist and a Data Engineer. I also hope this helps you to more easily identify both within your organisation.  If you’d like to learn more about our Data Practice at Equal Experts, please get in touch using the form below.

 

What do Data Science and User Experience have in common? 

On the surface, you might expect very little as they appear to oppose one another. How about when attempting to understand human behaviour? Both UX and Data Science specialists try and solve these problems, but with different approaches. On a recent engagement, we found that combining techniques from both disciplines yielded powerful results. 

The Problem

Our client wanted to understand their users’ needs while using a job-posting website. User personas are a popular tool for communicating user needs off the back of conducting user research. On this engagement, we wanted to see if we could use some data science techniques to provide quantitative validation of the initial qualitative user research

The Tension Model

We worked in partnership with Koos Service Design. One of the techniques Koos use to develop personas is to investigate conflicting user needs, called “Tensions”. For example, a tension when applying for a job could be the conflict between ‘finding the perfect job’ and ‘finding a job quickly’. Initial research to capture user needs was conducted through in-depth interviews, surveys and exploratory data analysis of user logs.

Initial Personas

From this small pool of data, an initial set of tensions was identified onto which personas (detailed below) are placed that encompass the different needs groups of users. 

This approach was based on low-volumes of qualitative user research data. To enhance and refine the personas we would need to conduct further testing and experimentation with a much larger dataset.

Machine Learning

With the information gathered during the initial user research, we developed a small survey asking True/False questions aimed at testing our hypotheses about the combination of needs people experienced.

This created an extremely large dataset on which we were able to use machine learning to group users together based on similarity.

The technique utilized was unsupervised k-means clustering. The aim of this is to group (or cluster) data that behaves similarly. An optimal number of 5 clusters was identified using the elbow method to minimise the error in the model without creating too many clusters. So the number of personas was revised to reflect this new information.

Conclusions

There was a lot of similarity between the initial personas and the final data-driven personas. The key divergence was the removal of one persona. However, there were sets of behaviours which persist between the initial and data-driven personas. For example, as the Survivor and the Quick Win, both have a desire to get a well-paid job quickly without any other preferences.

With these personas, the client was able to tailor individual user experiences based on their needs, ultimately improving customer satisfaction and engagement with the system.

This highlights how Data Science can bolster insights from UX design, leading to an end product more useful than using either technique in isolation.