Last week, John Lewis & Partners announced the effective closure of their head office in Victoria, which means that a lot of staff have had to adjust to working from home.
Our experience has been that John Lewis & Partners has taken to the new remote model extremely well. For one team, the change has had a positive impact on their ability to deliver. In the first week of the change, they almost doubled their throughput and performed more releases to customers than in any other given week in the last five months.
This team considers collaboration to be their superpower. They continue feeding and watering their team spirit in the new context. There is no single correct way to do this, but here are some of the experiments the teams are trying.
Running a perpetual mass Hangouts to mimic a live office environment where you can hear each other working. Back in the office, you would be able to simply turn around to a colleague and say, “Can I chat with you for two minutes about X, Y or Z?” and that’d be fine. Not only that, but due to collocation, others could eavesdrop on the conversation even if they were not directly involved. The team has emulated that by occasionally having meets on Hangouts and keeping Hangouts always on. This helps the team feel in touch with each other, and conversations can spontaneously spring up. Even though these meetings don’t necessarily involve everyone, team members still benefit from being able to listen in.
Time is set aside each day to do some form of meditation or mindfulness exercise. This is not a group activity, but team members do this at the same time each day. By synchronising these activities, the opportunities for collaborative working are maximised. The effect of taking this time is really felt. Afterwards, team members are all very much more relaxed and able to focus.
As part of this transition, we hosted a number of webinars to share good practices for a team working fully remote. Most of the John Lewis & Partners teams that we work with were already set up to enable home working. However, moving from a few people occasionally working remotely to everyone working remotely all the time is not a trivial transition. Our webinars are designed for teams that are already comfortable with working remotely. We share tips and practices that will really help them gel and perform in a remote-first environment.
For example if you want to learn some of the techniques we use to build high-performing remote-first teams watch this webinar.
Part of our mission working with John Lewis & Partners is to enable their Partners. This means that our consultants transfer the necessary skills and knowledge to the Partners so they can continue to develop new digital services and products for their customers.
At Equal Experts, we have been building a remote-first mindset for years and have engaged in a number of fully remote deliveries. That’s why we published and open-sourced our remote delivery playbook earlier this year.
The complexity of modern, distributed IT systems means they operate on the edge of failure. We help clients manage this risk by exploring the business impact of IT failure modes, and then investing time in improving the resilience of the people, products, and processes within the system. One way we do this is by running Chaos Days, in which we deliberately inject faults into a system and explore their impact.
Equal Experts have been working with John Lewis & Partners since 2018, to help them improve their ability to deliver and operate online retail services. This includes replacing existing eCommerce applications with cloud-based microservices running on the John Lewis & Partners Digital Platform (JLDP). This bespoke platform-as-a-service enables product teams to rapidly build and run services, using Continuous Delivery and Operability practices, including You Build It, You Run It.
We recently partnered with John Lewis & Partners to run their inaugural Chaos Day. We were delighted to share our expertise. The John Lewis & Partners engineers gained valuable learnings about how revenue impacting failures could occur, and how best to mitigate them. They also learned the requisite skills and knowledge to run their own Chaos Days independently.
John Lewis & Partners has 25 product teams building 40 services on their platform, which is built and run by two platform teams. This has created around 100 microservices, internal and external components, and downstream dependencies.
At Equal Experts, our experience of Chaos Days is to start small and then iterate. Rather than explore the impact of chaos on all teams and services, our first Chaos Day at John Lewis & Partners had two teams participating: one platform and one product team. A test environment was selected as the target for all experiments. To forewarn the organisation of any unintentional impact, and whet appetites for future Chaos Days, we preceded Chaos Day 1 with a presentation to John Lewis & Partners IT personnel on our approach.
Experiment selection and design
Two weeks before Chaos Day, we met with the most experienced members of the platform team, who would become the agents of chaos. The people in this role design and execute the experiments, which the rest of their team detect and respond to, as if they were real production incidents. We intentionally asked for the most experienced platform engineers in order to test response skills liquidity within the team.
We gathered around a whiteboard, and one of the engineers sketched out their platform architecture. The group then jointly brainstormed possible experiments, considering:
- What failure mode we would learn about, e.g. loss of connectivity, an instance being terminated, network slowdown
- What impact we would expect, e.g. impact on customer transactions
- What response we would expect, e.g. the service auto-heals, an alert is fired, or nobody notices!
- How normal service would be resumed, e.g. if the resolution is unsuccessful, how would the injected fault be rolled back?
- If this experiment were to run in isolation, e.g. would a monitoring fault limit learning from other experiments?
The agents of chaos came up with 40 ideas, which we reviewed and whittled down to 10. We did this based on risk and impact to John Lewis & Partners, i.e. which failures were most likely to happen and damage the business. Two weeks were then spent developing and testing the code and configuration changes for fault injection ahead of the Chaos Day itself. The agents of chaos had their own private Slack channel to aid collaboration without revealing to other team members what havoc was soon to be unleashed.
An example of something that was developed in this period was code to randomly terminate Google Kubernetes Engine (GKE) pods. The platform runs microservices in GKE pods, which don’t have a guaranteed lifetime. The code developed was then used in a Chaos Day experiment to learn more about how microservices respond to pod termination.
On the day itself at John Lewis & Partners, the agents of chaos locked themselves away in a meeting room, with plenty of snacks. They worked their way through their Trello board of chaos experiments. We monitored each experiment closely, to analyse impact and team response.
In one experiment, we simulated a Google Cloud SQL failure by stopping a database instance without warning. The owning team received alerts from the platform and reacted promptly, but they took longer than expected to diagnose and resolve the failure. There were some useful learnings about the owning team’s ability to resolve issues themselves without platform team assistance.
In another experiment, we reduced some outbound proxy timeout settings. We wanted to learn how well services handled failing requests to external dependencies. Some failures were detected by teams and responded to, but other failures went unnoticed. The agents of chaos had to provide a tip-off before the owning team became aware of a remaining problem. Some useful learnings on awareness, ownership and alerting came out of this.
As John Allspaw puts it:
After the Chaos Day, we held a review with all the people impacted by, and involved in, the chaos experiments. First, we established a joint timeline of events. Then we all shared what we saw, thought, did, and didn’t do, in response to the chaos. During this process, various improvements were identified and added to the appropriate team’s backlog. These included:
- Runbook changes to improve the diagnosis of uncommon HTTP status codes.
- Alerting on database issues, to enable a faster resolution time.
- Access permission changes, to enable teams to self-serve on Cloud SQL issues.
The Chaos Day review was published using a similar format to a standard post-incident review and hosted on an internal documentation site. Now anyone working at John Lewis & Partners can learn what happened, and hopefully be inspired to run their own Chaos Day.
We also presented the key learnings from the Chaos Day at a monthly internal community meeting, where all the John Lewis & Partners delivery teams socialise over pizza to learn what’s new.
It was great fun, and really rewarding to help John Lewis & Partners run their first Chaos Day. The John Lewis & Partners engineers have created some valuable learnings and developed new skills around improving resilience and running Chaos Days. The organisation has also developed a greater interest in Chaos Engineering, and more teams are keen to get involved. Plans are underway for a second, wider Chaos Day, which will involve more product and platform teams.
Equal Experts would love to help other organisations, in a similar way to how we’ve helped John Lewis & Partners. If you’d like to know more about how Chaos Days can help your organisation reduce the business impact of IT failures, then please get in touch.
We’ve also published a playbook on running a Chaos Day, which you might find helpful.