Our recent Chaos Day Playbook explains why and how to run a Chaos Day, covering the key outcomes, common problems and applications.
The purpose of a Chaos Day is to deliberately induce failure in a distributed IT system, and then observe, reflect and improve on the response to failure. In this way, organisations can build better knowledge about and resilience in IT systems.
This blog post will provide a brief overview of the key steps of running a Chaos Engineering Day. They are:
- Planning (who, what, when and where)
- Execution (running your experiments)
- Review (understanding the impact, response and chaos mechanics)
- Learning (capturing and sharing knowledge)
Planning a Chaos Day
We recommend starting a Chaos Day by identifying which system elements will be tested, and which engineering teams should be involved. If this is your first Chaos Cay, we advise starting small. Rather than involving the entire engineering team, choose one or two specific teams working with systems where the learning associated with a Chaos Day is most important.
Within each team that will be part of the Chaos Day, identify the most experienced engineer. This person will become the ‘agent of chaos’. This person will design and run the experiment, and so it’s important they have a good working knowledge of the system and its weaknesses.
As mentioned in our “When to hold a Chaos Day” blog post, you should arrange a planning session with your whole team two weeks before the chaos day. As Norah Jones explains in Chaos Engineering Trap 2, it’s important to have everyone involved in brainstorming potential experiments. Once this brainstorming is complete, do the final stage of planning only with the agents of chaos – to maintain an element of surprise on the Chaos Day itself.
Your planning should consider:
- What failure mode will be involved, for example, a partial connectivity loss, or network slowdown?
- What will be the impact of this in technical AND business terms?
- What is the anticipated response?
- If the team doesn’t resolve this failure, how will it be rolled back?
From here, shortlist 4-8 experiments that offer the most learning potential with acceptable business risk, which will be prepared for your scheduled Chaos Day.
Executing Chaos Day Experiments
Your participating teams should treat the Chaos Day as a real emergency situation, but that doesn’t mean they shouldn’t be prepared. Make sure people know when the Chaos Day will happen, what communication channel(s) to use, and if there will be a facilitator on hand to track progress through the experiments.
We’ve found using Trello boards helpful with columns to track experiments in progress, resolved by agents and those resolved by the owning team. Monitor each experiment closely, documenting and analysing the impact in a dedicated private channel, such as Slack.
Lastly, ensure that experiments are concluded and normal service restored before the end of the day.
How to Review a Chaos Day
As soon as possible after the Chaos Day, ensure you have review meetings with team members involved in the response, agents of chaos and other relevant team members. Each review should walk through the experiment timeline, discussing what people observed and did.
The focus here should be on discovering new insights about the system. If ideas for improvement come up, make a note and assign someone to consider them later. It’s important not to make knee-jerk responses at this stage.
Once the reviews have been completed, identify improvements you could make to the Chaos Day itself. Make a note so that you can return to these improvements next time you plan a Chaos Day.
Sharing the knowledge gained
The last important step in running a successful Chaos Day is to ensure that lessons are learned, and shared with all relevant team members. Create and share write-ups so that stakeholders can benefit from the chaos engineering experience. This might include posting them on Slack or a wiki, or holding a presentation for the wider team.
For more detail on each of these steps and how to apply them to your own Chaos Day, don’t miss our Chaos Day Playbook, which provides detailed guidance on how to plan, execute and review a Chaos Engineering Day in your organisation.