We spend a lot of our time thinking about how best to shield our clients from the potential pitfalls of digital business. We ask ourselves “what’s the worst that could happen?” – and work hard to mitigate the risk.
As part of these efforts, we recently ran a ‘Chaos Day’ with one of our clients – a major Government department that hosts around 50 digital delivery teams, distributed all around the UK. These teams design, deliver and support hundreds of microservices that serve online content to the department’s varied customers.
The microservices all run on a single platform, itself run by seven Platform Teams that take responsibility for distinct areas (infrastructure, security and so on). Equal Experts collaborates on the ongoing development of the Platform, with an array of infrastructure engineers, developers, testers and delivery leads spread across these teams.
Inspired by the Netflix’ Chaos Monkey and Amazon’s Game Day, the Platform Teams planned and executed their own Chaos Day – to see just how well they and the Platform coped when everything that could go wrong, does go wrong.
Why run such an event now? Well, since moving to Amazon Web Services (AWS), the Platform has benefited from improved stability and performance. We followed AWS’ best practices, using Auto Scaling Groups (ASGs) and multiple Availability Zones (AZs) to improve the availability of the Platform.
Another factor was that the client’s digital service teams are now well versed in dealing with traffic peaks. So we felt the time was right to put the resilience of the Platform and its teams to the test – and make sure no laurels were being sat on.
We aimed to test the impact of a perfect storm of things going wrong during our client’s busiest time of year. The chaos ranged from key engineers being hit by a bus, to half of our AWS infrastructure going down (all simulated: no animals, production systems, or people were harmed in the course of the event).
The event was fun, frantic and a great learning experience. As with any event of this kind, we learned a few things you might want to consider if running a similar kind of exercise:
Coordinated chaos sounds like an oxymoron, but to get the most from our Chaos Day, some planning was required. To keep the precise nature of our chaos a surprise for the teams facing it, our planning session was split as follows:
- An initial open session, to define the mechanics of Chaos Day – such as which environment would be used, who would participate, which Slack channels to use, and what was expected from the Platform and digital service teams.
- A second session, limited to our Agents of Chaos. We chose highly experienced, knowledgeable members from each platform team to fill these roles ie. the people you really wouldn’t want to be hit by a bus (but that wouldn’t really happen …. would it?)
We ran the game day on our Staging environment, while running peak load tests in the background. The Chaos team was kept separate from Platform Teams, and injected chaos in secret throughout the day. The Platform Teams treated issues as though they were real Production ones.
Over twenty chaotic disruptions were injected across the day, including failures to microservices, instance types, deployment tools, team members (i.e. making them unavailable / hit by a pretend bus), and availability zones.
The Platform Teams put up a valiant fight against the onslaught – especially as real Production issues occurred on the same day (with a certain sense of inevitability).
We followed up with a retrospective to identify what went well, what we needed to address and how we’d improve future Chaos Days. The improvements we identified included:
- Run it more regularly (at least quarterly);
- Run it further ahead of the next annual peak;
- Widen the scope of the teams we involve (e.g. we have an API team, who weren’t included this time).
- (Stretch goal) Run it in Production!
Even so, the day was a real success. It provided tangible confirmation that the platform is performant and resilient, and that the team is able to cope with a wide range of failures that might occur.
Bugs are inevitable in any complex system though, and the day would have been a failure if none were found, so it was helpful that twenty-two issues were identified as a result of the exercise.
On a personal note, I was greatly impressed by the passion, professionalism and expertise with which the Chaos Day was conducted, both in terms of the Platform Teams and the Agents of Chaos. It was a real privilege to be (a tiny) part of the day, and I can heartily recommend running a similar event.