Chaos engineering helps your organisation to answer fundamental questions about how your services perform when unexpected events occur. Does the service operate normally during traffic surges, or if there’s a single point of failure that crashes an app, or a network connection?
However, it’s critical to plan your Chaos Day so that these answers are captured during experiments, shared with your team in the immediate aftermath of the Chaos Day, and that knowledge applied to your systems and processes to make improvements and build system resilience.
The day after the Chaos Day
It’s wise to allow people 24 hours after a Chaos Day to ‘wind down’ and think about the experience. Then it’s time to hold a post-mortem style retrospective where the team can analyse what happened.
Some great resources for holding such retrospectives include:
- Jeli.io’s incident review guide
If several teams were involved in the Chaos Day, ask each team to run their own Chaos retrospective, then feed their top insights into a team-of-teams retrospective.
What happens in a chaos retrospective?
Remember, the purpose of a chaos retrospective is to capture learning and identify improvements that can be applied to your systems.
We recommend involving the whole team in the Chaos retrospective, not just agents of chaos and responders. Split the time into sections (or separate meetings) that focus on the following:
What did we learn from the experiments?
For each experiment, consider what the activity logs reveal about system behaviour, focusing on new knowledge, rather than just creating a long list of improvement tasks. If ideas for improvement do come up, assign someone to review these ideas later. Similarly, don’t get distracted by trying to ‘solve’ every issue, because the focus of the retrospective is on documenting learning from the chaos day, not solving each specific problem.
It’s important to create reviews that are thorough and encourage transparency, which means avoiding any ‘blame’ for a failure on an individual’s actions. Instead, for each action, consider:
- What happened?
- What was the impact on system performance?
- Why did this occur?
- What can we learn from this?
- How could we prevent this failure in future?
How do we improve future Chaos Days?
Spend the final ten minutes of the meeting noting the success of the Chaos Day itself. Note down what changes you could make to improve future Chaos Days.
What improvement considerations/ideas should be added to the team backlog?
The most important aspect of post-Chaos Day learning is to prioritise fixing the findings of the Chaos Day over developing new features. Ensure that senior stakeholders are committing to applying improvements identified by the Chaos Day, and assign improvement ideas to a specific member of the team.
Share the knowledge
Chaos Days can offer important insights and learning well beyond your immediate team. However, it’s important to note that not all stakeholders will be interested in the technical aspects of experiments and performance logs. After a retrospective meeting, capture the knowledge and share it in a form that is accessible to the wider organisation.
You should make this knowledge easy for other teams to find and consume. Consider writing publically about what you’ve done and learnt on an internal blog or platform such as Slack. Chaos engineering is still in its infancy in the software engineering discipline and making experience reports public will help this important practice mature.