The concept of Chaos Days is gaining popularity. For starters, it’s always interesting to cause chaos and break things.
But more importantly, Chaos Days provide an opportunity to improve system resilience by identifying potential weaknesses and understanding how to fix them.
But Chaos Days always involve some risk, and as more teams start using chaos engineering, there are a few common mistakes that are worth discussing.
In this blog post, we’ll consider five things that can go wrong during a Chaos Day – and how to ensure you avoid all of these pitfalls. If you want to find out more about how to run a Chaos Day, don’t miss our blog post , and download our Chaos Days Playbook for detailed guidance on planning and executing your next Chaos Day.
Chaos Day Problem 1: Wrong Timing
You can’t over-estimate the importance of having enough time to plan the timing of your Chaos Day.
Trying to run a Chaos Day with anything less than a few weeks’ notice means you’re unlikely to have enough time to pull together the right team members. You need to be able to identify the right team members, ensure they are all available on the chosen date, AND have capacity to take part in Chaos Day planning.
While your chaos team is planning your experiments, don’t forget everyone else. Make sure all team members receive ‘save the date’ invitations and be sure to share required pre-reading well in advance of the Chaos Day.
Chaos Day Problem 2: Wrong People
We recommend having the most experienced engineers from each team to act as agents of chaos, but your broader Chaos Day team should include other stakeholders. For example, it’s important to have people from DevOps, engineering and the wider business community involved. In general, the CIO or CTO should be responsible for the Chaos Day, to ensure it will not compromise wider IT strategy. Finally, it’s important to consider who, from the wider business, needs to be informed and responsible for changes made during Chaos Day.
Chaos Day Problem 3: Lack of Planning
For a Chaos Day to be successful, it should in many respects be ‘just another day’ – otherwise it’s difficult to create a realistic experience for the response team.
However, many of your engineers will be unfamiliar with how chaos engineering works, and so it’s helpful to ensure that people are aware of the possibility of a Chaos Day, and how it will operate within your organisation. Create and share a ‘playbook’ for Chaos Days and share it with your team well in advance of the ‘live’ Chaos Day. Encourage people to think about what experiments might happen, and how they could be approached.
Planning should also consider what you want to achieve with your Chaos Days. For example, your objectives might include building cross-team knowledge and communication, building or testing business continuity processes, or building a business case for specific technology or process investment.
Chaos Days represent a significant investment from the business in terms of time, resources and people. Given the costs involved, it’s important to consider whether Chaos Days are providing a return on investment, and how that will be measured and reported.
Chaos Day Problem 4: Running the wrong experiments
During the brainstorming process, it’s easy to focus on experiments that allow the engineering team to test the biggest potential failures, or the most likely. But this won’t always be the right approach.
When designing and selecting experiments, consider which are the most business critical services, today, and how that will change in the coming months. If a system is or will soon become critical to overall business success, then it’s important to test its resilience during a Chaos Day, even if a less critical system might be more likely to fail.
Second, consider how big the experiments you design should be. Think of Chaos Days like a bomb blast – how wide should the radius be, to ensure you can recover quickly enough to avoid impacting the business?
Chaos engineering is still a new concept for many teams, so it is better to start small with designed failures that are easy to roll back, and help build confidence and experience in your team. Measure the impact of these experiments before attempting more ambitious, large-scale experiments.
Chaos Day Problem 5: Failing to Learn from Chaos Engineering
During the excitement of a Chaos Day, it’s important to remember to monitor, capture and observe all the failures and team responses. For starters then, you need to plan carefully what tools and metrics will be used to monitor and measure the failure and its response. If a system response time increases, will you understand how much it increased, and also why it increased?
Your Chaos Day doesn’t end when the experiments end. You must be able to identify:
- What did we learn from the experiments?
- What would improve our next Chaos Day?
- What improvement considerations/ideas should we progress?
The second part of this learning comes from how you compile and share this information to build knowledge in your team, and identify possible improvements in systems and processes. This information needs to be shared at a Chaos Day post-mortem, which can be run individually by teams, and then shared in a team-of-teams retrospective.