John Lewis & Partners
Scaling You Build It You Run It at John Lewis & Partners
concurrent product teams
concurrent digital services
deployments a year, every 2-3 days
minutes on average to acknowledge an incident
minutes on average to resolve an incident
in-incident revenue protection effectiveness
About John Lewis & Partners
John Lewis & Partners is one of the UK's oldest, largest and most popular retailers. They operate 34 stores across the UK, as well as johnlewis.com. It has total trading sales of £4.93 billion, a workforce of 38,000 Partners (employees), and is part of the John Lewis Partnership - the largest employee-owned business in the UK.
Rethinking technology operations
In early 2018, John Lewis & Partners committed to reducing its annual multi-million pound opportunity costs, by replacing the monolithic COTS ecommerce platform behind johnlewis.com with tens of teams building user-centric digital services. The goal was to:
- scale delivery to many product teams in parallel, to meet business demand.
- accelerate deployments from weekly to daily, to improve customer experience and maximise revenue.
- improve johnlewis.com reliability from 99.0% to 99.9%, and time to resolution from 2 hours to less than 45 mins, to minimise revenue loss and costs during major incidents.
- encourage teams to increase quality and share a body of accumulated knowledge, to reduce unplanned BAU work.
As part of this work, Equal Experts was asked to help create a step change in technology operations. At the time, John Lewis & Partners had separate Delivery and Operations functions, with the application support team provided as a third party managed service. Deployments could not be accelerated, johnlewis.com reliability could not be improved any further, and the vendor feared being overwhelmed by the number of planned digital services. In addition, there was a support backlog with hundreds of tasks, with an estimated revenue impact in the tens of millions. The challenge set by John Lewis & Partners was how to re-design technology operations, and embed operability into teams at scale.
Equal Experts recommended moving to a You Build It You Run It operating model, in which autonomous product teams would run their own digital services and focus on business outcomes. The plan was to maximise incentives for engineers to build operability into digital services, and free up the application support team to concentrate on critical COTS applications.
Paving a road for incident response
At the time, John Lewis & Partners and Equal Experts were building the John Lewis & Partners Digital Platform (JLDP), to handle the ever-growing number of product teams and digital services. A digital platform is a collection of paved roads, providing fault-free and frictionless user journeys for product teams.
It was important to add an incident response paved road to JLDP. To be comfortable with the You Build It You Run It principle of on-call product teams, engineers needed incident response to be streamlined. The existing workflow involved multiple handoffs:
- An analyst created a ServiceNow ticket, started a private chat room, and phoned an application support analyst.
- The application support analyst worked to resolve the incident, and phoned an incident manager for major incidents.
- The incident manager coordinated urgent response efforts between multiple teams, and handled stakeholder communications.
A fully automated workflow was implemented for digital services in JLDP, which used the PagerDuty incident response platform to connect microservices with Slack and ServiceNow. JLDP automatically provisions a new digital service as an on-call policy and team rota in PagerDuty. When an alert is fired:
- PagerDuty automatically creates a ServiceNow ticket, and calls the on-call product team engineer immediately.
- The product team engineer creates a public chat room, works to resolve the incident, and contacts an incident manager for major incidents.
- The incident manager coordinates urgent response efforts between multiple teams, and handles stakeholder communications.
Incorporating as-is incident management was vital. John Lewis & Partners had incident managers who were skilled facilitators and communicators, and they lightened cognitive load for on-call engineers during incident response. Incident managers were modelled as a PagerDuty on-call team rota, and could be phoned during an incident with a single button click in PagerDuty.
The incident response paved road was a key step in the operability journey for John Lewis & Partners. Replacing spreadsheets and phone calls with PagerDuty alerts reduced Time To Acknowledge (TTA) from 5-20 minutes to a consistent 60 seconds. Switching from private to public chat rooms allowed anyone to learn from response efforts, during and after incidents. Implementing bi-directional sync between PagerDuty and ServiceNow improved data capture quality in incident tickets, and allowed customer service teams to contact product teams by simply raising a ServiceNow ticket.
Balancing outcomes protection with run costs
The run cost of You Build It You Run It at scale was a concern at the outset for John Lewis & Partners. It was assumed 20, 30, or 40 product teams would require 20, 30, or 40 on-call engineers, and incur a linear run cost.
Early conversations focused on You Build It You Run It as an insurance policy for business outcomes, and the need to optimise run costs without weakening operability incentives for engineers. The answer was to link availability level and on-call level to a financial exposure band – the expected revenue loss and operational costs in an incident. Financial exposure bands and availability levels were sourced from a pre-existing John Lewis & Partners policy.
Product managers were asked to estimate the maximum financial exposure for each digital service. A service availability calculator was then used to assign an availability target and on-call level. This process incentivised product managers to prioritise operational features alongside product features, and brought opportunity costs into the same conversations as run costs. An example calculator is shown below, with some artificial financial exposure bands.
All product teams are on-call during working hours for their own digital services. This incentivises engineers to minimise BAU maintenance work, and implement operational features such as monitoring dashboards, message queues, and circuit breakers. Out of hours, teams have the following guidelines:
- Low exposure. A team has nobody on-call out of hours, and no operations support. This pushes engineers to still invest in operational features, to minimise incident resolution on the next working day.
- Medium exposure. Teams in the same product domain are recommended to rotate one engineer on-call for their combined digital services. Product domains are used as affinity groupings, to encourage a focus on business outcomes and a low cognitive load.
- High exposure. A team has one engineer on-call for their digital services. This ensures the fastest possible time to restore.
The below example shows some digital services at John Lewis & Partners. The basket, electricals, fashion, and upholstery services are part of the same Commercial Journeys product domain, so each night one person is on-call from the three teams for the four services.
Assessing on-call product teams
Initially, You Build It You Run It caused some concerns from product team engineers, who had not been on-call before. Understanding and overcoming their fears, by openly discussing and addressing them, was just as important as changes to processes and tools. The operating model now receives very positive feedback from product teams, who are constantly learning by building and running their own digital services at scale.
The rate of deployments has dramatically increased, and in-incident financial losses have fallen. A good example of this was Black Friday 2020, which saw record user traffic due to the COVID-19 pandemic. The johnlewis.com website performed well, and online sales were 50% higher than 2019.
In 2021, John Lewis & Partners and Equal Experts ran a cost/benefit analysis on two years of data, which validated the value proposition of You Build It You Run It. In deployment throughput, product teams averaged one deploy every 3 days, whereas the application support team averaged weekly deployments with high variability.
In service reliability, on-call product teams had fewer incidents with a financial loss than the third party managed service, and their time to restore was much faster. 44 digital services had a Mean Time To Recovery (MTTR) of 27 mins with low variability. The application support team had an MTTR of 1 hour with high variability.
The table below shows that You Build It You Run It was superior in terms of scale, throughput, and reliability. It includes a revenue protection effectiveness measure, which was devised for a fair comparison between the two operating models despite their differing teams, services, and ranges of financial exposure. Revenue protection effectiveness was defined as the % of expected revenue loss per incident that was realised, based on the time to restore. You Build It You Run it was more effective, year on year.
“With over 30 years in IT, and 10 years leading Operations teams, it’s been hard for me to let go of having direct control of a single Ops team. However, I’ve been amazed that You Build It You Run It has provided a step change compared to traditional ITIL best practice, and the outcomes we’ve demonstrated speak for themselves”Simon Skelton, Platform & Operations Manager at John Lewis & Partners
Want to know more?
Are you interested in this project? Or do you have one just like it? Get in touch. We'd love to tell you more about it.