DSML: Lessons learned from building a Data Science and Machine Learning platform
A data science and machine learning (DSML) platform is a set of tools and services that enable data scientists and machine learning engineers to develop models rapidly, and streamline data analytics. It should also help users to promote solutions from ideation to being operational more easily, helping organisations to bring new services to market faster.
DSML doesn’t only reduce time to market. It can also help development teams through improved reusability of features, governance and compliance support, cost optimisation and improved cross-team collaboration and knowledge sharing.
That said, the complex nature of work and the constantly evolving technology landscape can make building a DSML platform challenging. In this article, we will discuss some of the lessons learned while building a data science and machine learning platform for a leading mining company.
1. Identify and understand the user
Your strategy and product vision must be built upon a good understanding of what value the platform will bring to the business, who will be using it and what level of expertise they will have. Are they seasoned data scientists, or citizen analysts? What problems will this platform solve, and how does that tie in with the organisation’s future plans?
When we understand the user and their needs, this helps us to make better technology choices. For example, no code / low code tools might be an excellent choice for democratising data science, but an engineering-oriented toolset might be more suitable for a small group of highly experienced engineers and analysts.
2. Think big, start small
It’s easy to try and please everyone or over-engineer a solution when there are a vast amount of tools available, and broad expectations from the product team. The problem with this is that the DSML platform could turn into a half-baked solution – and stay that way for a long time.
Instead, take the advice of Barry O’Reilly, author of Unlearn. He advises teams to ‘think big, and start small’. In other words, have a strategy to build a world-class, comprehensive solution but define a roadmap and start with a small use case that delivers a big difference to the organisation. When building a DSML platform, you might want to consider whether this could be batch analytics, online inference, streaming analytics or something else. Again, understanding the customer’s needs is critical.
3. Get the right team
It’s essential to have a cross-functional team covering all aspects of delivery to build a data science and machine learning platform quickly. Depending on the solution chosen, your team could include machine learning engineers, data engineers, cloud engineers, developers and data scientists to support users. It isn’t only about skills, but also privileges. In our case, the single most delayed feature was related to integration with a central Data Lake. Our data engineer did not have sufficient permissions to access it which slowed down debugging and the introduction of changes. Tackling such dependencies through embedding the right people in the team or building relationships early pays off in a faster delivery.
4. The right data science projects
A DSML platform will have constraints related to its architecture and technology used. It’s vital to analyse these constraints, and share them with users. A good example here might be the support for Python and its ecosystem of libraries. A data scientist can easily use any of the available library versions in a local environment but that might not be necessarily the case on the shared platform.
Projects that are in an advanced phase of development or already developed can be particularly tricky to migrate. They can use legacy tools that are not supported on the platform. Moreover, the way the legacy models were developed might make migration very expensive and in some cases, the end model might generate different predictions than the original one.
5. Enable growth
The number of platform users will usually increase over time so it’s important to ensure that the effort for day-to-day support from the platform team is manageable.
Deliver self-service through the automation of procedures and privilege delegation to ensure that product teams can handle BAU tasks without being dependent on platform engineers.
The second big enabler for growing product teams is ensuring they know how to use the platform effectively. Learning at scale requires high-quality training materials for onboarding and building a community where users learn from each other.
The time savings can be used for adding more features but more importantly for engaging in meaningful conversations with users about how the platform should evolve. In our case, the Hassle Maps concept proved to be particularly useful here.
6. Prioritise service availability
Keeping services available is critical to productivity and embedding trust in the platform. Machine Learning can be very resource-demanding. It’s crucial to monitor the cloud infrastructure and react to the changes in demand for CPU, memory, storage but also performance (API response time).
In the event of a production incident, it is invaluable to have a defined incident management process. The Google SRE book provides a great set of practices to ensure the high reliability of the service. In our case, introducing a triaging process with clear responsibilities together with blameless post-mortems resulted in 24 times faster MTTR (Mean Time To Recovery).
7. Share successes
Finally, don’t forget to celebrate and share your successes. Gather stories about how the product teams achieved their goals by utilising the platform. Couple them with metrics to show specific platform outcomes. These could be quantitative (for example active monthly users, time to market, number of API calls) or qualitative (NPS, user surveys). Share these successes with a wider audience through internal communication channels but also on sprint reviews and system demos.