DSML: Lessons learned from building a Data Science and Machine Learning platform
A data science and machine learning (DSML) platform is a set of tools and services that enable data scientists and machine learning engineers to develop models rapidly, and streamline data analytics. It should also help users to promote solutions from ideation to being operational more easily, helping organisations to bring new services to market faster.
DSML doesn’t only reduce time to market. It can also help development teams through improved reusability of features, governance and compliance support, cost optimisation and improved cross-team collaboration and knowledge sharing.
That said, the complex nature of work and the constantly evolving technology landscape can make building a DSML platform challenging. In this article, we will discuss some of the lessons learned while building a data science and machine learning platform for a leading mining company.
1. Identify and understand the user
Your strategy and product vision must be built upon a good understanding of what value the platform will bring to the business, who will be using it and what level of expertise they will have. Are they seasoned data scientists, or citizen analysts? What problems will this platform solve, and how does that tie in with the organisation’s future plans?
When we understand the user and their needs, this helps us to make better technology choices. For example, no code / low code tools might be an excellent choice for democratising data science, but an engineering-oriented toolset might be more suitable for a small group of highly experienced engineers and analysts.
2. Think big, start small
It’s easy to try and please everyone or over-engineer a solution when there are a vast amount of tools available, and broad expectations from the product team. The problem with this is that the DSML platform could turn into a half-baked solution – and stay that way for a long time.
Instead, take the advice of Barry O’Reilly, author of Unlearn. He advises teams to ‘think big, and start small’. In other words, have a strategy to build a world-class, comprehensive solution but define a roadmap and start with a small use case that delivers a big difference to the organisation. When building a DSML platform, you might want to consider whether this could be batch analytics, online inference, streaming analytics or something else. Again, understanding the customer’s needs is critical.
3. Get the right team
It’s essential to have a cross-functional team covering all aspects of delivery to build a data science and machine learning platform quickly. Depending on the solution chosen, your team could include machine learning engineers, data engineers, cloud engineers, developers and data scientists to support users. It isn’t only about skills, but also privileges. In our case, the single most delayed feature was related to integration with a central Data Lake. Our data engineer did not have sufficient permissions to access it which slowed down debugging and the introduction of changes. Tackling such dependencies through embedding the right people in the team or building relationships early pays off in a faster delivery.
4. The right data science projects
A DSML platform will have constraints related to its architecture and technology used. It’s vital to analyse these constraints, and share them with users. A good example here might be the support for Python and its ecosystem of libraries. A data scientist can easily use any of the available library versions in a local environment but that might not be necessarily the case on the shared platform.
Projects that are in an advanced phase of development or already developed can be particularly tricky to migrate. They can use legacy tools that are not supported on the platform. Moreover, the way the legacy models were developed might make migration very expensive and in some cases, the end model might generate different predictions than the original one.
5. Enable growth
The number of platform users will usually increase over time so it’s important to ensure that the effort for day-to-day support from the platform team is manageable.
Deliver self-service through the automation of procedures and privilege delegation to ensure that product teams can handle BAU tasks without being dependent on platform engineers.
The second big enabler for growing product teams is ensuring they know how to use the platform effectively. Learning at scale requires high-quality training materials for onboarding and building a community where users learn from each other.
The time savings can be used for adding more features but more importantly for engaging in meaningful conversations with users about how the platform should evolve. In our case, the Hassle Maps concept proved to be particularly useful here.
6. Prioritise service availability
Keeping services available is critical to productivity and embedding trust in the platform. Machine Learning can be very resource-demanding. It’s crucial to monitor the cloud infrastructure and react to the changes in demand for CPU, memory, storage but also performance (API response time).
In the event of a production incident, it is invaluable to have a defined incident management process. The Google SRE book provides a great set of practices to ensure the high reliability of the service. In our case, introducing a triaging process with clear responsibilities together with blameless post-mortems resulted in 24 times faster MTTR (Mean Time To Recovery).
7. Share successes
Finally, don’t forget to celebrate and share your successes. Gather stories about how the product teams achieved their goals by utilising the platform. Couple them with metrics to show specific platform outcomes. These could be quantitative (for example active monthly users, time to market, number of API calls) or qualitative (NPS, user surveys). Share these successes with a wider audience through internal communication channels but also on sprint reviews and system demos.
When our client receives project requests from their customers, a lot of time and cost is spent on resourcing project management – determining which teams should conduct the work – before we even start delivering the work. Here’s how we used data science and dashboarding to speed up this processing time and provide up-to-date metrics on the project delivery process.
The problem of resourcing project management
Our client runs an ever-growing department of over 800 people, delivering numerous projects in parallel, and the number of projects grows year after year. However, the client’s method of distributing work to the relevant teams (what they call their impacting process) hasn’t scaled with the success the client is having.
Impacting is a resource-intensive process requiring each team to read multiple documents – sometimes up to 25k words – to identify whether they are required for the project, and often they’re not. This results in a slow, manual process that requires multiple redundant points of contact.
After a project has been through the impacting process and is being delivered, there is no automated reporting. Typically, reporting is triggered by a status request from a senior leader, at which point the data is manually collected, creating slow and infrequent feedback loops.
This is an intensive process which puts tremendous strain on an already busy department, especially as they currently have to process over 100 project requests a week.
Our aim is to reduce the number of people involved in a project impact to only the most relevant individuals, and to streamline the amount of reading required to understand the project.
Leveraging data science for improved project resourcing and reporting
As the client had no clear insight on in-progress projects, we determined that the most useful first step was to provide reporting on these projects using data from their Jira ticketing system. This allows senior leaders to access project delivery information quickly and interactively, enabling them to identify issues and bottlenecks before they become problems.
We then focused on reducing the resource overhead in the impacting process. Project impacting is designed to determine which teams are required to work on a project. In this case, it involved a lot of people reading large documents which were potentially irrelevant to their team’s specialism.
So we sought to improve the impacting process in two ways:
- Can we reduce the amount of time needed to understand the project?
- Can we highlight the project to only the relevant teams?
The scope of data science
Reducing time to understanding
With a typical design document being approximately 25,000 words, it takes a person roughly 3-4 hrs to read. Reducing the amount of text needed to understand the document would result in significant time savings per person.
This was done in a variety of ways; firstly we used an AI model to summarise the text while retaining important information, allowing users to control the degree of summarisation. This summarisation method is also being used to create executive summaries for the senior leaders who constantly switch context between pieces of work, and need to very quickly understand different projects.
Secondly, we extracted keywords from the text so the user can rapidly determine important terms within the document.
These tools have proved very useful in enabling individuals to quickly establish whether they need to read the document in full, and can slim down reading time from a few hours to a few minutes.
Identifying Relevant People
Typically 12+ people can end up reading these documents, meaning that each project takes 6+ days of work just to impact – and many of these people are not even relevant to the project. Therefore, reducing the number of people reading these documents to only the most relevant compounds the savings given through document summarisation.
To do this we developed a machine learning classifier to determine which teams were relevant to a project, reducing the people required for impacting. Additionally, we identified similar existing projects and the teams involved in those, to further assist in establishing the right teams for the work.
A future enhancement we wish to add is building a recommender system that automatically alerts people if new projects arrive that are similar to previous projects they have delivered, further reducing the operational overhead.
The business value of improving project resourcing and reporting through data science
The client is now able to direct incoming projects to the relevant teams much faster, reducing the delay between a project’s request and work starting, and improving new customer satisfaction. The people involved in impacting now have time freed up to lead the deliveries of in-progress projects, which also benefits existing customers and team efficiency.
In the mid 2010’s there was a step change in the rate at which businesses started to focus on gaining valuable insights from data.
As the years have passed, the importance of data management has started to sink in throughout the industry. Organisations have realised that you can build the best models, but if your data isn’t qualitative, your results will be wrong.
There are many, varied job roles within the data space. And I always thought the distinction of the roles were pretty obvious. However, recently a lot has been written about the difference between the different data roles, and more specifically the difference between Data Scientists and Data Engineers.
I think it’s important to understand that not knowing these differences can be instrumental in teams failing or underperforming with data. Which is why I am writing this article. To attempt to clarify the roles, what they mean, and how they fit together. I hope that this will help you to understand the differences between a Data Scientist and a Data Engineer within your organisation.
What do the Data Engineer and Data Scientist roles involve?
So let’s start with the basics. Data Engineers make data available to the business, and Data Scientists enable decisions to be made with the data.
Data Engineers, at a senior level, design and implement services that enable the business to gain access to its data. They do this by building systems that automagically ingest, transform and publish data, whilst gathering relevant metadata (lineage, quality, category, etc.), enabling the right data to be utilised.
Data Scientists not only utilise the data made available, but also uncover additional data that can be combined and processed to solve business problems.
Both Data Scientists and Data Engineers apply similar approaches to their work. They identify a problem, they look for the best solution, then they implement the solution. The key difference is the problems they look at and, depending on their experience, the approach taken to solving it.
Data Engineers like Software Engineers, or even more generally engineers, tend to use a process of initial development, refinement and automation.
Initial development, refinement and automation explained, with cars.
In 1908 Henry Ford released the Model T Ford. As you can see, it has many of the same features as a modern car – wheels on each corner, a bonnet, a roof, seats, a steering wheel, brakes, gears.
In 1959 the first Mini was released. It had all the same features as the Model T Ford. However, it was more comfortable, cheaper, easier to drive, easier to maintain, and more powerful. It also incorporated new features like windscreen wipers, a radio, indicators, rear view mirrors. Basically, the car had, over 50 years, been incrementally improved.
Step forward in time to 2010, and Tesla released the Models S and X. These too have many features we can see in the Model T Ford and the Mini. But now they also contain some monumental changes.
The internal combustion engine is replaced with electric drive. It has sat-nav, autopilot, and even infotainment. All of which combine to make the car much easier and more pleasurable to drive.
What we are seeing is the evolution of the car from the initial production line – basic but functional – through multiple improvements in technology, safety, economy, driver and passenger comforts. All of which improve the driving experience.
In other words we are seeing initial development, refinement and automation. A process that Data Engineers and Data Scientists know only too well.
For Data Engineers the focus is on data, getting it from source systems to targets, ensuring the data quality is qualified, the lineage captured, the attributes tagged, and the access controlled.
What about Data Scientists? They absolutely follow the same pattern, but they additionally look to develop analytics along the Descriptive, Diagnostic, Predictive, Prescriptive scale.
So why is there confusion between the Data Scientist and Data Engineer roles?
There is of course not a single answer but some of the common reasons include:
- At the start, both Data Scientist and Data Engineers spend a lot of time Data Wrangling. This means trying to get the data into a shape where it can be used to deliver business benefits.
- At first, the teams are often small and they always work very closely together, in fact, in very small organisations they may be the same person – so it’s easy to see where the confusion might come from.
- It’s often given to Data Engineers to “productionise” analytics model created by Data Scientists.
- Many Data Engineers and Data Scientists dabble in each other’s areas, as there are many skills both roles need to employ. These can include data wrangling, automation and algorithms..
As the seniority of data roles develop, so do the differences.
When I talk to and work with Data Engineers and Data Scientists, I can often categorise them into one of three categories – Junior, Seasoned, Principal – and when I work with Principals, in either space, you can tell they are a world apart in their respective fields.
So what differentiates the different levels and roles?
That’s it. I hope this article helps you to more easily understand the differences between a Data Scientist and a Data Engineer. I also hope this helps you to more easily identify both within your organisation. If you’d like to learn more about our Data Practice at Equal Experts, please get in touch using the form below.
What do Data Science and User Experience have in common?
On the surface, you might expect very little as they appear to oppose one another. How about when attempting to understand human behaviour? Both UX and Data Science specialists try and solve these problems, but with different approaches. On a recent engagement, we found that combining techniques from both disciplines yielded powerful results.
The Problem
Our client wanted to understand their users’ needs while using a job-posting website. User personas are a popular tool for communicating user needs off the back of conducting user research. On this engagement, we wanted to see if we could use some data science techniques to provide quantitative validation of the initial qualitative user research
The Tension Model
We worked in partnership with Koos Service Design. One of the techniques Koos use to develop personas is to investigate conflicting user needs, called “Tensions”. For example, a tension when applying for a job could be the conflict between ‘finding the perfect job’ and ‘finding a job quickly’. Initial research to capture user needs was conducted through in-depth interviews, surveys and exploratory data analysis of user logs.
Initial Personas
From this small pool of data, an initial set of tensions was identified onto which personas (detailed below) are placed that encompass the different needs groups of users.
This approach was based on low-volumes of qualitative user research data. To enhance and refine the personas we would need to conduct further testing and experimentation with a much larger dataset.
Machine Learning
With the information gathered during the initial user research, we developed a small survey asking True/False questions aimed at testing our hypotheses about the combination of needs people experienced.
This created an extremely large dataset on which we were able to use machine learning to group users together based on similarity.
The technique utilized was unsupervised k-means clustering. The aim of this is to group (or cluster) data that behaves similarly. An optimal number of 5 clusters was identified using the elbow method to minimise the error in the model without creating too many clusters. So the number of personas was revised to reflect this new information.
Conclusions
There was a lot of similarity between the initial personas and the final data-driven personas. The key divergence was the removal of one persona. However, there were sets of behaviours which persist between the initial and data-driven personas. For example, as the Survivor and the Quick Win, both have a desire to get a well-paid job quickly without any other preferences.
With these personas, the client was able to tailor individual user experiences based on their needs, ultimately improving customer satisfaction and engagement with the system.
This highlights how Data Science can bolster insights from UX design, leading to an end product more useful than using either technique in isolation.