When is the best time to hold a Chaos Day?

The complexity of modern IT systems makes it impossible to predict how they will respond to every potential failure or interruption.

Chaos engineering helps organisations explore possible adverse events by designing and running controlled experiments that involve introducing a failure, then analysing the impact and response to that failure. 

             Click to read the full playbook

Chaos engineering experiments present any organisation with certain risks – introducing an error into any IT system could have unintended consequences. This means it’s essential to plan such experiments carefully. If the experiments are to be part of a  focussed event, such as a Chaos Day make sure that: 

  • You have appropriate time to plan the experiments, and people required to plan and execute the experiments are available at the scheduled time. 
  • Your proposed date does not clash with major business events or planned changes to your existing IT systems 
  • The existing systems are stable, and will be able to recover from introduced failures or errors 

These six questions are designed to help you understand when is the best time to hold a Chaos Day in your organisation: 

Does the business consider a Chaos Day to be important at the current time? 

Chaos Days generally provide a good return on investment for IT organisations, but the benefits might not be as obvious in the short-term as the benefit of investing in new capabilities. This can cause organisations to ‘put off’ Chaos Days, or not hold them at all. 

If this is the case, then we recommend starting with a smaller investment such as a time-boxed risk assessment, using a FAIR (factor analysis to information risk) approach. This type of exercise provides an opportunity to explore what failures could happen, their frequency and the potential impact they would have. The results can give meaningful data that can be used to build a business case for Chaos Days that might persuade stakeholders to prioritise resilience over or alongside new capabilities. 

Do we have enough time to plan, execute and review a Chaos Day?

It’s vital that Chaos Days are carefully scheduled to ensure that the organisation has the time and resources available to fully engage with the experiment. The timeline for a Chaos Day must allow sufficient time for planning. We recommend that you arrange a planning session with the whole team at least two weeks before the Chaos Day.

Ideally, the Chaos Day should be scheduled for a date that won’t impact key business events. For example, if the target environment is severely degraded, check this won’t delay any production releases that need to pass through it around that date. You will then need to allow time to schedule review meetings after the Chaos Day for teams, to capture learning and make recommendations. 

You’ll need to schedule time to discuss and plan experiments, and decide when they should be run. You’ll need a couple of weeks after planning to allow engineering teams enough time to design, implement and test the experiments. Don’t forget to check if other teams have any key dependencies on the target environment and services over the Chaos Day, to avoid causing problems with experiments. 

Do we have any major changes planned? 

In general, the best time to schedule a Chaos Day is sufficiently far away from business changes to avoid disruption, but close enough to planned changes to provide useful insight and resolution time of problems that could impact those changes. For example, an online retailer could hold a Chaos Day two months before a seasonal peak in traffic, while a manufacturer could organise a Chaos Day to test its supply chain management before implementing a new shipping process. 

It’s important to allow time between the Chaos Day and the business change, so that learning can be distilled and improvements applied. At one of our clients with a very large platform (1,000+ microservices processing 1 billion requests on peak days), we found that 2-3 Chaos Days per year was ideal. 

Is our system currently stable?

Chaos Days offer important benefits, but if your systems are currently unstable and there are regular incidents and failures, you might already have enough chaos to contend with! In this instance, we advise focusing on improving post-incident reviews to bring about system stability. Once you’ve had a few months free of issues, then you could try a small Chaos Day in pre-production to test that stability. 

Are the right people available to support our Chaos Day?

To get the best results from a Chaos Day you’ll need experienced engineers across relevant teams, to design and execute experiments. Ensure that these people are not currently committed to other tasks at the time you will need them for your Chaos Day.  

How much time do we need for a Chaos Day? 

To identify the best time to run a Chaos Day, first decide whether you need just a few hours, a whole day, or whether the event should be spread over several days. 

Running a Chaos Day over a single day results in a more intensive but shorter event, which can add stress. Our experience is that this intensity also improves team dynamics and leads to a more memorable event. The disadvantage of a single Chaos Day is that it can be hard to maintain an element of surprise, especially in a pre-production environment. Teams need to be informed to treat failures in this environment as though production were on fire, so it’s likely they’ll be paying close attention. 

Spreading a Chaos Day over a number of days makes it easier to spring surprises on the team, because people won’t know exactly when experiments will be run during the chosen period. It also allows for adjustments to be made to experiments, using learning from early experiments to improve later ones.

Download our Chaos Day playbook in pdf if you prefer

For more insight into the benefits of running a Chaos Day, along with expert guidance on how and when to organise Chaos Days for the maximum benefit, check out the playbook online here.

 

As technology professionals, stability is the gold standard. But once in a while, it is important to create chaos. 

             Click to read the full playbook

A Chaos Day is an opportunity to carry out carefully planned experiments that introduce errors and turbulent conditions in IT systems, such as terminating a compute instance or filling up a storage device. It’s a useful exercise for any organisation that wants to understand the impact and system response, and then use that understanding to improve reliability and resilience.

Wondering whether you could benefit from a Chaos Day? Here are six reasons why your organisation should run a Chaos Day this year: 

1: To prepare for the unexpected  

The main benefit of a Chaos Day is that it helps the organisation prepare for inevitable failures and unexpected events, before  they actually do occur. In today’s complex IT environments, turbulence will happen due to  single point failures or multiple, unrelated failures – often combined with sudden changes in external pressure, such as traffic spikes or security threats. 

2: To analyse how your team prepares for and responds to problems

Carrying out a Chaos Day allows you to view, analyse and improve how your team responds to unexpected, turbulent conditions. It can provide a safe way to identify gaps in your teams’ skills around collaborating, communicating and thinking during high-stress periods. 

3: To improve skills and knowledge across your IT team 

During a Chaos Day, you can expect your team to gain: 

  • New knowledge about system behaviour 
  • Expertise in diagnosing and resolving incidents 
  • Better skills around collaboration and communication 
  • Greater understanding of system failures and recovery 

Teams will also share knowledge while working through problems on a Chaos Day, meaning each team has a better understanding of their colleagues’ knowledge and skills. Chaos Days can also improve technical knowledge, which can be used to make changes that boost resilience. For example, a chaos day can illustrate the usefulness of new features such as retry mechanisms and circuit breakers. 

4: To build resilience 

The ultimate goal of a Chaos Day is to build resilience through greater understanding of system behaviour and failure scenarios when tackling production incidents or developing system enhancements. 

Chaos Days improve system resilience by improving: 

  • The skills, knowledge and understanding of your team 
  • Processes, by guiding improvements in incident management, analysis and engineering 
  • Products, by initiating changes that make services more resilient, and by improving documentation such as error messages and runbooks

Download our Chaos Day playbook in pdf if you prefer

For more insight into the benefits of running a Chaos Day, along with expert guidance on how and when to organise Chaos Days for the maximum benefit, check out the playbook online here.