Chaos Days are an opportunity to introduce disruption to your IT systems, so that you can understand how they will respond to possible ‘real’ disruptions. Of course, it’s also a highly effective way for teams to practice and improve how they respond to IT failures.
In this article, we’ll discuss some of the most common questions about Chaos Days and how they can be used to improve IT service resilience. If you’d like to find out more about how to plan, organise and run your own Chaos Day, don’t miss our Chaos Days Playbook, which you can download for free.
Q: What is a Chaos Day?
A Chaos Day is an event that runs over one or more days where teams can explore how their service responds to failures safely. During a Chaos Day, teams design and run controlled experiments in pre-production or production environments. Each experiment injects a failure into the service, such as terminating a compute instance, or filling up a storage device) and the team observes and analyses the impact and overall system response, and the response of the supporting team. Chaos Days are a practice within the field of chaos engineering.
Q: What is chaos engineering?
Chaos engineering is defined as the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. Chaos engineering teams design and run experiments that inject failures into a target system, so the team can learn how the system responds. This learning improves the resilience of the system by:
- Equipping the team with deeper understanding about system behaviour
- Informing the team on where to invest in order to improve system resilience
Q: Why do we organise Chaos Days?
Chaos Days provide a focal event for your team to practice chaos engineering. They are especially useful to teams that might be less familiar with this discipline, because they introduce chaos engineering in a structured, boundaried manner.
Chaos Days improve system resilience by helping your people learn about systems, and gain experience in how to diagnose and solve problems in high-stress situations. They provide an opportunity to improve processes such as incident management, incident analysis and engineering approaches, such as how faults should be handled and how resilience testing is performed during feature development.
Finally, Chaos Days help organisations to initiate changes that make services more resilient, improve observability and make services and dependencies better understood.
Q: How do you implement chaos engineering?
Chaos engineering is a broad and deep discipline, to which our Chaos Day playbook provides a great introduction, including a 5-minute guide to running a Chaos Day. Once you’ve digested that, the simplest next steps are to:
- Decide which part of your system you want to learn more about.
- Come up with a hypothesis for how that part responds to specific failures.
- Design and run an experiment to test that hypothesis, by injecting a failure into that part of the system. The failure injection can be manual (e.g. stop a service the system depends on) or automated (e.g. use infrastructure-as-code to remove access to the service for the duration of the experiment).
- Observe how the system responds to the failure and review as a team what was learnt from this experiment and any changes you should make as a result of it.
- Rinse and repeat.
Q: What are some top tips for Chaos Days?
- Start small, with one or two teams and a few experiments, not tens of teams and tens of experiments. This allows you to adapt and learn how to run a Chaos Day in your specific context, before scaling out to multiple teams and many experiments.
- Plan ahead – it’s possible to run a mini chaos event in a single day, but you’ll get the most from any chaos event by scheduling time in advance to design and run experiments, then reflect and share the lessons extracted from them.
- Spread knowledge by involving the whole team, but limiting how much diagnosis and repair your most experienced engineers do – either treat them as absent for that day or pair them with less experienced team members.
- Be conscious of business critical events that the chaos might impact (especially if it gets out of control). Also, allow time to return the system to its normal state. You don’t want to take down a key environment just when it’s needed for a critical release.
Q: What tools are available for running a Chaos Day? How should we run a Chaos Day if we’re running AWS?
The experiments you run during a Chaos Day typically modify system configuration or behaviour in some way that simulates a failure (e.g. shutting down a compute instance, closing a network connection). These modifications can either be done manually (e.g. through temporarily editing configuration) or in a more automated manner via tooling such as infrastructure-as-code (IaC), Chaos Monkey, AWS Fault Injection Simulator or Gremlin. If you want to repeat experiments or track them via source-control, then the tooling approach is preferable, as it codifies the experiment and automates its injection and rollback.
Q: How to set up a chaos engineering day?
That’s simple – just follow our playbook!