The complexity of modern, distributed IT systems means they operate on the edge of failure. We help clients manage this risk by exploring the business impact of IT failure modes, and then investing time in improving the resilience of the people, products, and processes within the system. One way we do this is by running Chaos Days, in which we deliberately inject faults into a system and explore their impact.
Equal Experts have been working with John Lewis & Partners since 2018, to help them improve their ability to deliver and operate online retail services. This includes replacing existing eCommerce applications with cloud-based microservices running on the John Lewis & Partners Digital Platform (JLDP). This bespoke platform-as-a-service enables product teams to rapidly build and run services, using Continuous Delivery and Operability practices, including You Build It, You Run It.
We recently partnered with John Lewis & Partners to run their inaugural Chaos Day. We were delighted to share our expertise. The John Lewis & Partners engineers gained valuable learnings about how revenue impacting failures could occur, and how best to mitigate them. They also learned the requisite skills and knowledge to run their own Chaos Days independently.
John Lewis & Partners has 25 product teams building 40 services on their platform, which is built and run by two platform teams. This has created around 100 microservices, internal and external components, and downstream dependencies.
At Equal Experts, our experience of Chaos Days is to start small and then iterate. Rather than explore the impact of chaos on all teams and services, our first Chaos Day at John Lewis & Partners had two teams participating: one platform and one product team. A test environment was selected as the target for all experiments. To forewarn the organisation of any unintentional impact, and whet appetites for future Chaos Days, we preceded Chaos Day 1 with a presentation to John Lewis & Partners IT personnel on our approach.
Experiment selection and design
Two weeks before Chaos Day, we met with the most experienced members of the platform team, who would become the agents of chaos. The people in this role design and execute the experiments, which the rest of their team detect and respond to, as if they were real production incidents. We intentionally asked for the most experienced platform engineers in order to test response skills liquidity within the team.
We gathered around a whiteboard, and one of the engineers sketched out their platform architecture. The group then jointly brainstormed possible experiments, considering:
- What failure mode we would learn about, e.g. loss of connectivity, an instance being terminated, network slowdown
- What impact we would expect, e.g. impact on customer transactions
- What response we would expect, e.g. the service auto-heals, an alert is fired, or nobody notices!
- How normal service would be resumed, e.g. if the resolution is unsuccessful, how would the injected fault be rolled back?
- If this experiment were to run in isolation, e.g. would a monitoring fault limit learning from other experiments?
The agents of chaos came up with 40 ideas, which we reviewed and whittled down to 10. We did this based on risk and impact to John Lewis & Partners, i.e. which failures were most likely to happen and damage the business. Two weeks were then spent developing and testing the code and configuration changes for fault injection ahead of the Chaos Day itself. The agents of chaos had their own private Slack channel to aid collaboration without revealing to other team members what havoc was soon to be unleashed.
An example of something that was developed in this period was code to randomly terminate Google Kubernetes Engine (GKE) pods. The platform runs microservices in GKE pods, which don’t have a guaranteed lifetime. The code developed was then used in a Chaos Day experiment to learn more about how microservices respond to pod termination.
On the day itself at John Lewis & Partners, the agents of chaos locked themselves away in a meeting room, with plenty of snacks. They worked their way through their Trello board of chaos experiments. We monitored each experiment closely, to analyse impact and team response.
In one experiment, we simulated a Google Cloud SQL failure by stopping a database instance without warning. The owning team received alerts from the platform and reacted promptly, but they took longer than expected to diagnose and resolve the failure. There were some useful learnings about the owning team’s ability to resolve issues themselves without platform team assistance.
In another experiment, we reduced some outbound proxy timeout settings. We wanted to learn how well services handled failing requests to external dependencies. Some failures were detected by teams and responded to, but other failures went unnoticed. The agents of chaos had to provide a tip-off before the owning team became aware of a remaining problem. Some useful learnings on awareness, ownership and alerting came out of this.
As John Allspaw puts it:
After the Chaos Day, we held a review with all the people impacted by, and involved in, the chaos experiments. First, we established a joint timeline of events. Then we all shared what we saw, thought, did, and didn’t do, in response to the chaos. During this process, various improvements were identified and added to the appropriate team’s backlog. These included:
- Runbook changes to improve the diagnosis of uncommon HTTP status codes.
- Alerting on database issues, to enable a faster resolution time.
- Access permission changes, to enable teams to self-serve on Cloud SQL issues.
The Chaos Day review was published using a similar format to a standard post-incident review and hosted on an internal documentation site. Now anyone working at John Lewis & Partners can learn what happened, and hopefully be inspired to run their own Chaos Day.
We also presented the key learnings from the Chaos Day at a monthly internal community meeting, where all the John Lewis & Partners delivery teams socialise over pizza to learn what’s new.
It was great fun, and really rewarding to help John Lewis & Partners run their first Chaos Day. The John Lewis & Partners engineers have created some valuable learnings and developed new skills around improving resilience and running Chaos Days. The organisation has also developed a greater interest in Chaos Engineering, and more teams are keen to get involved. Plans are underway for a second, wider Chaos Day, which will involve more product and platform teams.
Equal Experts would love to help other organisations, in a similar way to how we’ve helped John Lewis & Partners. If you’d like to know more about how Chaos Days can help your organisation reduce the business impact of IT failures, then please get in touch.
We’ve also published a playbook on running a Chaos Day, which you might find helpful.