DSML: Lessons learned from building a Data Science and Machine Learning platform
A data science and machine learning (DSML) platform is a set of tools and services that enable data scientists and machine learning engineers to develop models rapidly, and streamline data analytics. It should also help users to promote solutions from ideation to being operational more easily, helping organisations to bring new services to market faster.
DSML doesn’t only reduce time to market. It can also help development teams through improved reusability of features, governance and compliance support, cost optimisation and improved cross-team collaboration and knowledge sharing.
That said, the complex nature of work and the constantly evolving technology landscape can make building a DSML platform challenging. In this article, we will discuss some of the lessons learned while building a data science and machine learning platform for a leading mining company.
1. Identify and understand the user
Your strategy and product vision must be built upon a good understanding of what value the platform will bring to the business, who will be using it and what level of expertise they will have. Are they seasoned data scientists, or citizen analysts? What problems will this platform solve, and how does that tie in with the organisation’s future plans?
When we understand the user and their needs, this helps us to make better technology choices. For example, no code / low code tools might be an excellent choice for democratising data science, but an engineering-oriented toolset might be more suitable for a small group of highly experienced engineers and analysts.
2. Think big, start small
It’s easy to try and please everyone or over-engineer a solution when there are a vast amount of tools available, and broad expectations from the product team. The problem with this is that the DSML platform could turn into a half-baked solution – and stay that way for a long time.
Instead, take the advice of Barry O’Reilly, author of Unlearn. He advises teams to ‘think big, and start small’. In other words, have a strategy to build a world-class, comprehensive solution but define a roadmap and start with a small use case that delivers a big difference to the organisation. When building a DSML platform, you might want to consider whether this could be batch analytics, online inference, streaming analytics or something else. Again, understanding the customer’s needs is critical.
3. Get the right team
It’s essential to have a cross-functional team covering all aspects of delivery to build a data science and machine learning platform quickly. Depending on the solution chosen, your team could include machine learning engineers, data engineers, cloud engineers, developers and data scientists to support users. It isn’t only about skills, but also privileges. In our case, the single most delayed feature was related to integration with a central Data Lake. Our data engineer did not have sufficient permissions to access it which slowed down debugging and the introduction of changes. Tackling such dependencies through embedding the right people in the team or building relationships early pays off in a faster delivery.
4. The right data science projects
A DSML platform will have constraints related to its architecture and technology used. It’s vital to analyse these constraints, and share them with users. A good example here might be the support for Python and its ecosystem of libraries. A data scientist can easily use any of the available library versions in a local environment but that might not be necessarily the case on the shared platform.
Projects that are in an advanced phase of development or already developed can be particularly tricky to migrate. They can use legacy tools that are not supported on the platform. Moreover, the way the legacy models were developed might make migration very expensive and in some cases, the end model might generate different predictions than the original one.
5. Enable growth
The number of platform users will usually increase over time so it’s important to ensure that the effort for day-to-day support from the platform team is manageable.
Deliver self-service through the automation of procedures and privilege delegation to ensure that product teams can handle BAU tasks without being dependent on platform engineers.
The second big enabler for growing product teams is ensuring they know how to use the platform effectively. Learning at scale requires high-quality training materials for onboarding and building a community where users learn from each other.
The time savings can be used for adding more features but more importantly for engaging in meaningful conversations with users about how the platform should evolve. In our case, the Hassle Maps concept proved to be particularly useful here.
6. Prioritise service availability
Keeping services available is critical to productivity and embedding trust in the platform. Machine Learning can be very resource-demanding. It’s crucial to monitor the cloud infrastructure and react to the changes in demand for CPU, memory, storage but also performance (API response time).
In the event of a production incident, it is invaluable to have a defined incident management process. The Google SRE book provides a great set of practices to ensure the high reliability of the service. In our case, introducing a triaging process with clear responsibilities together with blameless post-mortems resulted in 24 times faster MTTR (Mean Time To Recovery).
7. Share successes
Finally, don’t forget to celebrate and share your successes. Gather stories about how the product teams achieved their goals by utilising the platform. Couple them with metrics to show specific platform outcomes. These could be quantitative (for example active monthly users, time to market, number of API calls) or qualitative (NPS, user surveys). Share these successes with a wider audience through internal communication channels but also on sprint reviews and system demos.
Data governance is a challenge for many organisations, especially when it comes to knowing what data is available and where it is stored.
Typically, we would implement a data catalogue to address this challenge. But getting the most out of these technologies is about more than just installing a tool. In this post, we want to share how applying the tools and techniques of a product manager can help to address data governance problems.
Typical data management challenges
We often work with organisations that have built up large and potentially very useful data estate. This type of organisation commonly experiences data management issues, including:
– High turnaround time to discover information around business and data assets
– Data analysts had low confidence in data due to limited shared context
– Lack of confidence in data sets because of a lack of lineage
– Siloed information leading to effort duplication
These issues cause enormous frustration. Business managers can’t see if projects were successful because it is hard to locate data, while data analysts are unsure if a given data set is trustworthy. Every department has its own definition of what a customer is (and their own data sets).
What does product thinking look like?
As product managers, we would apply a classic product-based approach to shape our response to these challenges. This is: discover -> prototype-> implement approach
Conducting user research helps to identify challenges the organisation has with information management, and can include workshops and individual meetings. These meetings should allow you to identify personas, pain points, frustrations and where improvements can be made. It’s also an opportunity to learn about strategic technical drivers, which helps to identify the most appropriate tools and technology.
By the end of the discovery stage, we had defined the bigger picture and associated vision, goals and nets. We also had a defined set of success measures, which were:
- Time saved by data analysts and other functions (either as a result of improved access or because it was quicker to find the right data asset.)
- Employee Empowerment – does a data catalogue improve employee knowledge of available data and does it make them feel more empowered?
- Increased speed to onboard new employees. Does a data catalogue improve how quickly you can onboard someone onto an analyst team?
- Reduced Infrastructure and related costs – does a catalogue enable people to find existing data sets or reports and does this lead to reduced infrastructure costs?
In a standard product approach, we will prototype the product to assess how people use it and evaluate whether benefits are likely to be met. Some organisations might not be convinced about the need for a prototype, but it’s essential to developing new products and introducing new services, and can make a critical contribution to success.
If it is important to get feedback on the catalogue quickly, it’s not necessary to implement a prototype of the whole service. Whilst automating metadata entry is a big accelerator for data catalogues, for the prototype we use handcrafted templates so that we can get feedback on the user experience quickly and understand what metadata was most relevant.
Once a prototype is in place, having one to one sessions with users helps gather feedback on their use of the tool. We look at how users are being helped to do typical tasks along with a post-test questionnaire that measures acceptance, usability and user satisfaction with the data catalogue. Some of the important points that will come out of this kind of evaluation might include:
- The catalogue displays various metadata for data assets. Using the prototype we can assess which ones are most useful
- Users can easily see where the data came from through a lineage function – this shows what the source is, and how close the asset is to the source. This really helps users assess their confidence in the data-set.
- It also shows a snapshot view – a small sample of the data, users should find this helpful in understanding the content.
During the prototype phase we could also learn:
- What is the most popular tool? Search, or something else?
- Do users find it helpful to request access through a basket function that automates the work of manually finding out who owns the data, making contact with them, waiting for them to have the bandwidth to provide data access etc?
- How useful are crowdsourcing tools? In one recent project users liked that they could provide ratings and reviews of data sources. This helped other users find what they wanted.
- What percentage of data sets are redundant, perhaps because they are duplicate or not useful?
This user feedback means that we are able to iterate on the prototype to improve the design. For example, in some instances, user feedback showed that users were overwhelmed by the standard metadata. In this case, we created customised views based on their feedback. Additionally, new users struggled to understand how to use the tool, so we created an introduction session that walked new users through how to use it.
Finally, the initial catalogue organisation was difficult to navigate so it was refined based on user feedback.
The actual implementation of the production catalogue was undertaken by the catalogue vendor’s implementation partner. In this phase, the supplier continued to give direction and worked with the client to measure the success factors.
We hope that this post has given some insight into some of the considerations when creating a data catalogue solution, and shows how product thinking isn’t just about user interfaces. It can also be applied to data to help organisations shape important data services.