How to capture learning from a Chaos Day
Chaos engineering helps your organisation to answer fundamental questions about how your services perform when unexpected events occur. Does the service operate normally during traffic surges, or if there’s a single point of failure that crashes an app, or a network connection?
However, it’s critical to plan your Chaos Day so that these answers are captured during experiments, shared with your team in the immediate aftermath of the Chaos Day, and that knowledge applied to your systems and processes to make improvements and build system resilience.
The day after the Chaos Day
It’s wise to allow people 24 hours after a Chaos Day to ‘wind down’ and think about the experience. Then it’s time to hold a post-mortem style retrospective where the team can analyse what happened.
Some great resources for holding such retrospectives include:
- Jeli.io’s incident review guide
If several teams were involved in the Chaos Day, ask each team to run their own Chaos retrospective, then feed their top insights into a team-of-teams retrospective.
What happens in a chaos retrospective?
Remember, the purpose of a chaos retrospective is to capture learning and identify improvements that can be applied to your systems.
We recommend involving the whole team in the Chaos retrospective, not just agents of chaos and responders. Split the time into sections (or separate meetings) that focus on the following:
What did we learn from the experiments?
For each experiment, consider what the activity logs reveal about system behaviour, focusing on new knowledge, rather than just creating a long list of improvement tasks. If ideas for improvement do come up, assign someone to review these ideas later. Similarly, don’t get distracted by trying to ‘solve’ every issue, because the focus of the retrospective is on documenting learning from the chaos day, not solving each specific problem.
It’s important to create reviews that are thorough and encourage transparency, which means avoiding any ‘blame’ for a failure on an individual’s actions. Instead, for each action, consider:
- What happened?
- What was the impact on system performance?
- Why did this occur?
- What can we learn from this?
- How could we prevent this failure in future?
How do we improve future Chaos Days?
Spend the final ten minutes of the meeting noting the success of the Chaos Day itself. Note down what changes you could make to improve future Chaos Days.
What improvement considerations/ideas should be added to the team backlog?
The most important aspect of post-Chaos Day learning is to prioritise fixing the findings of the Chaos Day over developing new features. Ensure that senior stakeholders are committing to applying improvements identified by the Chaos Day, and assign improvement ideas to a specific member of the team.
Share the knowledge
Chaos Days can offer important insights and learning well beyond your immediate team. However, it’s important to note that not all stakeholders will be interested in the technical aspects of experiments and performance logs. After a retrospective meeting, capture the knowledge and share it in a form that is accessible to the wider organisation.
You should make this knowledge easy for other teams to find and consume. Consider writing publically about what you’ve done and learnt on an internal blog or platform such as Slack. Chaos engineering is still in its infancy in the software engineering discipline and making experience reports public will help this important practice mature.
Chaos Days are a great way to learn about and improve an IT system’s resilience, by deliberately invoking failures in a controlled manner and then observing their impact.
This article describes a set of practices for identifying which failures to explore. These practices have origins in process engineering, but are equally valid in the IT domain and can be used at any point in the development lifecycle.
How does this help?
- Predicting error conditions can help design mitigations. In most IT this is usually about improving the user experience of a failure, but in safety critical applications it can extend to avoiding injury or even death!
- Improving knowledge of a system’s behaviour can inform where to get most value from automated and manual testing.
- Knowing which failures are predictable can be used to improve monitoring and alerting.
- It produces a list of expected behaviour under failure conditions that can help prioritise which failures can be safely explored during a Chaos Day, and which should be avoided due to their impact being too severe.
Fault Analysis Techniques
The four techniques are
- Functional Failure Analysis
- Failure Mode and Effects Analysis (FMEA)
- Fault Tree Analysis
- Hazard and Operability Study (HAZOPs, for once an “x-Ops” that isn’t just adding operations!)
Each of these is a team exercise involving stakeholders and subject matter experts (and a formal HAZOPs requires particular participants to meet regulatory requirements). Whilst in the process engineering world we would create some documentation about the safety of a chemical plant (for example), it’s the process of thinking about the system from various viewpoints that is of most interest to me.
Let’s take a brief look at each one.
Functional Failure Analysis
This can be done very early on, without even an architecture defined, because what we are considering is the effect on the system of a failure of one or more of the functions of that system.
For example, when thinking about a car as a system, you could consider the functions of “steering”, “braking”, “acceleration” etc. all without any idea of how those functions are implemented. Likewise, a retail website would have functions of “displaying goods”, “searching for products”, “placing orders” and so on – relatively easy to enumerate, whilst the design is no more than a handful of boxes and lines.
So the failure analysis part is to ask of each function – how could this function fail, and what are the consequences of that failure? As a prompt for this question we need to ensure that we consider a) loss of function, b) function provided when not required, and c) incorrect function.
This practice has largely been replaced by FMEA (and is sometimes referred to as a functional FMEA) but is still a useful technique for examining the system in a generic fashion because we can apply it early in the lifecycle, or with a group of stakeholders who don’t know or care about the system detail.
Failure Mode and Effects Analysis
In order to do this analysis, we need to have the architecture defined – at least to the point where we have a separation of the components of the system. It can be done at different levels of detail, and the more detailed the examination the finer grained the effects can be.
Consider the system as a combination of components, then take each component and detail what the effect would be on the system if there were a failure of that component. In this case, we have some detail about the component so we can consider the realistic ways in which it can fail.
For instance, in a car, if the power steering component fails – the steering overall does not fail but it can be more difficult for the driver to steer, so we should indicate a power steering failure to them. Whereas, if the steering arm broke on one side then the steering would be significantly impaired; in this case a regular MOT checks the condition of the part.
In a retail website made of microservices, if the payment service failed then you wouldn’t be able to take any orders; if it was the address lookup service that failed, it would minorly inconvenience customers and potentially lead to misaddressed orders.
Fault Tree Analysis
In fault tree analysis we flip the whole thing upside-down and start instead with the possible externally detectable ways the system can fail. Then, for each fault, we follow a process a bit like the “5 whys” – but instead of asking “why did this happen?” we ask “what could cause this?”. We aren’t looking for a root cause, but to build a tree of things that could happen in the system that could contribute to a fault.
Importantly, some of those things will be systems doing what they should be doing but in combination with something else not working properly.
This tree of causes is arranged not as a simple tree, but each node is associated to its branches with a logical condition; it can then be used to identify points where the fault tree can be disrupted so that a particular combination of conditions does not result in a fault.
Last, but certainly not least, is the HAZOP process.
There have been various attempts to codify this to apply to software specifically – but those have been formal safety based regulatory-compliance drives. If the process is considered more for the value of the conversation it stimulates, then it’s easier to apply it.
In simple terms, the HAZOP process examines the flow of material through the system (clearly in process engineering, that’s the real flow of chemicals etc. – in software it would be data). At each point where chemicals are used or data is processed, we use a bunch of keywords to guide our thinking on what might happen:
- “No or not”
- “Other than”
- “As well as”
- “Part of”
- “Reverse (of intent)”
It’s fairly easy to see what each means in process engineering – but each can be readily applied to data flows as well.
If you think of an API – what happens when the client receives data “other than” it was expecting? An asynchronous process receives data “Late” or “After” some other data. Each keyword can be applied with a bit of creativity to software data flows.
In each case you would consider the causes of the condition, the consequences for the system and the action that might need to be taken.
Whilst I’ve only really scratched the surface of these techniques, I hope I’ve highlighted some of the advantages of examining your system from the variety of perspectives they encourage, and illustrated how you can use them.
Find out more
Many of the texts around these topics are process engineering focused, some are 40 years old (though still relevant), and expensive (many are three figures).
Some good ones are suggested reading on these course modules from the Safety Critical Masters degree course at the University of York:
For more details of the techniques without delving too deep into your expense budgets, the wikipedia pages are fairly well detailed:
Finally our open-source, Chaos Day playbook provides lots of useful guidance on why, how and when to run a Chaos Day.
Despite the name, planning a Chaos Day should help you deliver a carefully developed series of experiments that test for specific weaknesses within an application or system. Here, we run through the key steps in planning your next chaos engineering game day
Before you start
There are a number of questions that it’s important to ask before you start planning the specifics of a Chaos Day. Without a clear goal in mind, you could just end up causing chaos that doesn’t deliver any learning or system improvements.
So ask yourself – who do I need to attend my Chaos Day? What systems and processes do I want to focus on? How will we measure success? Do we have a specific budget? Where will the Chaos Day take place?
Whether you’re experimenting on a single service or at scale on an entire digital platform, planning your Chaos Day is essential to make the most of your investment of time and energy. While the process is similar regardless of scale, the organisational complexity, commitment and elapsed time increase with the number of services and teams involved. Because of this, our advice is to start small, so you can learn and adapt the process to your particular situation. Start with one service or team, not an entire engineering platform, then grow incrementally with each subsequent Chaos Day.
The risks involved with chaos also increase with the number of systems and teams involved. For this reason, we advise building in small, time-boxed system risk assessments using a system like FAIR. This will give you the chance to explore what potential failures might happen, their frequency and the magnitude of their image.
Identify target teams
Running a Chaos Day requires people’s time and system usage, so it needs to be carefully scheduled. Remember that the benefits of a Chaos Day might not seem as compelling to engineers who are working on new features, so it’s important to let people know the Chaos Day is happening, and to communicate the value and outcomes of the Chaos Day.
Several weeks before your Chaos Day is scheduled, start to plan your Chaos Day team and ask people to ‘save the date’ for the project. Check out our guidance on when is the right time to schedule a Chaos Day in this post. Make sure you have an appropriate venue secured for your Chaos Day.
Your agents of chaos are people who will carry out experiments. Gathering this team together and brainstorming ‘what if’ scenarios helps you to generate ideas for candidate experiments.
Identify target services and malfunctions
At the same time as planning your Chaos Day team, you need to think about what experiments you’ll run, and what you’re testing for. You might have a specific system that is about to experience a peak in traffic that needs to be tested for resilience. Or you might be considering a new process or integration that needs to be tested for potential bugs or unexpected events.
Understanding what systems are to be tested and why is important in helping senior stakeholders understand the value and purpose of the Chaos Day.
Plan and design experiments
Start by agreeing on benchmarking data for what is normal system performance, and then plan experiments that will allow you to monitor performance when the system is put under strain. Typically, chaos experiments introduce variables that reflect events such as a server crash, network failure or hard drive malfunction.
We recommend allowing around 2-4 weeks for experiment design, making sure to check dates and environment with stakeholders. Be sure to communicate with stakeholders about what, why, where, when and how experiments will happen. For more guidance on what experiments to run on a Chaos Day, see our Playbook.
Communicate at all times
Throughout your planning process, ensure that stakeholders are aware of what is being planned, and the potential impact on business as usual.
Ahead of your Chaos Day, you may also need to distribute pre-event materials to your chosen agents of chaos, or an agenda for the day, so they are fully prepared.
If you’d like to find out more about how to plan, organise and run your own chaos day, don’t miss our Chaos Days Playbook.
The concept of Chaos Days is gaining popularity. For starters, it’s always interesting to cause chaos and break things.
But more importantly, Chaos Days provide an opportunity to improve system resilience by identifying potential weaknesses and understanding how to fix them.
But Chaos Days always involve some risk, and as more teams start using chaos engineering, there are a few common mistakes that are worth discussing.
In this blog post, we’ll consider five things that can go wrong during a Chaos Day – and how to ensure you avoid all of these pitfalls. If you want to find out more about how to run a Chaos Day, don’t miss our blog post , and download our Chaos Days Playbook for detailed guidance on planning and executing your next Chaos Day.
Chaos Day Problem 1: Wrong Timing
You can’t over-estimate the importance of having enough time to plan the timing of your Chaos Day.
Trying to run a Chaos Day with anything less than a few weeks’ notice means you’re unlikely to have enough time to pull together the right team members. You need to be able to identify the right team members, ensure they are all available on the chosen date, AND have capacity to take part in Chaos Day planning.
While your chaos team is planning your experiments, don’t forget everyone else. Make sure all team members receive ‘save the date’ invitations and be sure to share required pre-reading well in advance of the Chaos Day.
Chaos Day Problem 2: Wrong People
We recommend having the most experienced engineers from each team to act as agents of chaos, but your broader Chaos Day team should include other stakeholders. For example, it’s important to have people from DevOps, engineering and the wider business community involved. In general, the CIO or CTO should be responsible for the Chaos Day, to ensure it will not compromise wider IT strategy. Finally, it’s important to consider who, from the wider business, needs to be informed and responsible for changes made during Chaos Day.
Chaos Day Problem 3: Lack of Planning
For a Chaos Day to be successful, it should in many respects be ‘just another day’ – otherwise it’s difficult to create a realistic experience for the response team.
However, many of your engineers will be unfamiliar with how chaos engineering works, and so it’s helpful to ensure that people are aware of the possibility of a Chaos Day, and how it will operate within your organisation. Create and share a ‘playbook’ for Chaos Days and share it with your team well in advance of the ‘live’ Chaos Day. Encourage people to think about what experiments might happen, and how they could be approached.
Planning should also consider what you want to achieve with your Chaos Days. For example, your objectives might include building cross-team knowledge and communication, building or testing business continuity processes, or building a business case for specific technology or process investment.
Chaos Days represent a significant investment from the business in terms of time, resources and people. Given the costs involved, it’s important to consider whether Chaos Days are providing a return on investment, and how that will be measured and reported.
Chaos Day Problem 4: Running the wrong experiments
During the brainstorming process, it’s easy to focus on experiments that allow the engineering team to test the biggest potential failures, or the most likely. But this won’t always be the right approach.
When designing and selecting experiments, consider which are the most business critical services, today, and how that will change in the coming months. If a system is or will soon become critical to overall business success, then it’s important to test its resilience during a Chaos Day, even if a less critical system might be more likely to fail.
Second, consider how big the experiments you design should be. Think of Chaos Days like a bomb blast – how wide should the radius be, to ensure you can recover quickly enough to avoid impacting the business?
Chaos engineering is still a new concept for many teams, so it is better to start small with designed failures that are easy to roll back, and help build confidence and experience in your team. Measure the impact of these experiments before attempting more ambitious, large-scale experiments.
Chaos Day Problem 5: Failing to Learn from Chaos Engineering
During the excitement of a Chaos Day, it’s important to remember to monitor, capture and observe all the failures and team responses. For starters then, you need to plan carefully what tools and metrics will be used to monitor and measure the failure and its response. If a system response time increases, will you understand how much it increased, and also why it increased?
Your Chaos Day doesn’t end when the experiments end. You must be able to identify:
- What did we learn from the experiments?
- What would improve our next Chaos Day?
- What improvement considerations/ideas should we progress?
The second part of this learning comes from how you compile and share this information to build knowledge in your team, and identify possible improvements in systems and processes. This information needs to be shared at a Chaos Day post-mortem, which can be run individually by teams, and then shared in a team-of-teams retrospective.
Our recent Chaos Day Playbook explains why and how to run a Chaos Day, covering the key outcomes, common problems and applications.
The purpose of a Chaos Day is to deliberately induce failure in a distributed IT system, and then observe, reflect and improve on the response to failure. In this way, organisations can build better knowledge about and resilience in IT systems.
This blog post will provide a brief overview of the key steps of running a Chaos Engineering Day. They are:
- Planning (who, what, when and where)
- Execution (running your experiments)
- Review (understanding the impact, response and chaos mechanics)
- Learning (capturing and sharing knowledge)
Planning a Chaos Day
We recommend starting a Chaos Day by identifying which system elements will be tested, and which engineering teams should be involved. If this is your first Chaos Cay, we advise starting small. Rather than involving the entire engineering team, choose one or two specific teams working with systems where the learning associated with a Chaos Day is most important.
Within each team that will be part of the Chaos Day, identify the most experienced engineer. This person will become the ‘agent of chaos’. This person will design and run the experiment, and so it’s important they have a good working knowledge of the system and its weaknesses.
As mentioned in our “When to hold a Chaos Day” blog post, you should arrange a planning session with your whole team two weeks before the chaos day. As Norah Jones explains in Chaos Engineering Trap 2, it’s important to have everyone involved in brainstorming potential experiments. Once this brainstorming is complete, do the final stage of planning only with the agents of chaos – to maintain an element of surprise on the Chaos Day itself.
Your planning should consider:
- What failure mode will be involved, for example, a partial connectivity loss, or network slowdown?
- What will be the impact of this in technical AND business terms?
- What is the anticipated response?
- If the team doesn’t resolve this failure, how will it be rolled back?
From here, shortlist 4-8 experiments that offer the most learning potential with acceptable business risk, which will be prepared for your scheduled Chaos Day.
Executing Chaos Day Experiments
Your participating teams should treat the Chaos Day as a real emergency situation, but that doesn’t mean they shouldn’t be prepared. Make sure people know when the Chaos Day will happen, what communication channel(s) to use, and if there will be a facilitator on hand to track progress through the experiments.
We’ve found using Trello boards helpful with columns to track experiments in progress, resolved by agents and those resolved by the owning team. Monitor each experiment closely, documenting and analysing the impact in a dedicated private channel, such as Slack.
Lastly, ensure that experiments are concluded and normal service restored before the end of the day.
How to Review a Chaos Day
As soon as possible after the Chaos Day, ensure you have review meetings with team members involved in the response, agents of chaos and other relevant team members. Each review should walk through the experiment timeline, discussing what people observed and did.
The focus here should be on discovering new insights about the system. If ideas for improvement come up, make a note and assign someone to consider them later. It’s important not to make knee-jerk responses at this stage.
Once the reviews have been completed, identify improvements you could make to the Chaos Day itself. Make a note so that you can return to these improvements next time you plan a Chaos Day.
Sharing the knowledge gained
The last important step in running a successful Chaos Day is to ensure that lessons are learned, and shared with all relevant team members. Create and share write-ups so that stakeholders can benefit from the chaos engineering experience. This might include posting them on Slack or a wiki, or holding a presentation for the wider team.
For more detail on each of these steps and how to apply them to your own Chaos Day, don’t miss our Chaos Day Playbook, which provides detailed guidance on how to plan, execute and review a Chaos Engineering Day in your organisation.
Chaos Days are an opportunity to introduce disruption to your IT systems, so that you can understand how they will respond to possible ‘real’ disruptions. Of course, it’s also a highly effective way for teams to practice and improve how they respond to IT failures.
In this article, we’ll discuss some of the most common questions about Chaos Days and how they can be used to improve IT service resilience. If you’d like to find out more about how to plan, organise and run your own Chaos Day, don’t miss our Chaos Days Playbook, which you can download for free.
Q: What is a Chaos Day?
A Chaos Day is an event that runs over one or more days where teams can explore how their service responds to failures safely. During a Chaos Day, teams design and run controlled experiments in pre-production or production environments. Each experiment injects a failure into the service, such as terminating a compute instance, or filling up a storage device) and the team observes and analyses the impact and overall system response, and the response of the supporting team. Chaos Days are a practice within the field of chaos engineering.
Q: What is chaos engineering?
Chaos engineering is defined as the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. Chaos engineering teams design and run experiments that inject failures into a target system, so the team can learn how the system responds. This learning improves the resilience of the system by:
- Equipping the team with deeper understanding about system behaviour
- Informing the team on where to invest in order to improve system resilience
Q: Why do we organise Chaos Days?
Chaos Days provide a focal event for your team to practice chaos engineering. They are especially useful to teams that might be less familiar with this discipline, because they introduce chaos engineering in a structured, boundaried manner.
Chaos Days improve system resilience by helping your people learn about systems, and gain experience in how to diagnose and solve problems in high-stress situations. They provide an opportunity to improve processes such as incident management, incident analysis and engineering approaches, such as how faults should be handled and how resilience testing is performed during feature development.
Finally, Chaos Days help organisations to initiate changes that make services more resilient, improve observability and make services and dependencies better understood.
Q: How do you implement chaos engineering?
Chaos engineering is a broad and deep discipline, to which our Chaos Day playbook provides a great introduction, including a 5-minute guide to running a Chaos Day. Once you’ve digested that, the simplest next steps are to:
- Decide which part of your system you want to learn more about.
- Come up with a hypothesis for how that part responds to specific failures.
- Design and run an experiment to test that hypothesis, by injecting a failure into that part of the system. The failure injection can be manual (e.g. stop a service the system depends on) or automated (e.g. use infrastructure-as-code to remove access to the service for the duration of the experiment).
- Observe how the system responds to the failure and review as a team what was learnt from this experiment and any changes you should make as a result of it.
- Rinse and repeat.
Q: What are some top tips for Chaos Days?
- Start small, with one or two teams and a few experiments, not tens of teams and tens of experiments. This allows you to adapt and learn how to run a Chaos Day in your specific context, before scaling out to multiple teams and many experiments.
- Plan ahead – it’s possible to run a mini chaos event in a single day, but you’ll get the most from any chaos event by scheduling time in advance to design and run experiments, then reflect and share the lessons extracted from them.
- Spread knowledge by involving the whole team, but limiting how much diagnosis and repair your most experienced engineers do – either treat them as absent for that day or pair them with less experienced team members.
- Be conscious of business critical events that the chaos might impact (especially if it gets out of control). Also, allow time to return the system to its normal state. You don’t want to take down a key environment just when it’s needed for a critical release.
Q: What tools are available for running a Chaos Day? How should we run a Chaos Day if we’re running AWS?
The experiments you run during a Chaos Day typically modify system configuration or behaviour in some way that simulates a failure (e.g. shutting down a compute instance, closing a network connection). These modifications can either be done manually (e.g. through temporarily editing configuration) or in a more automated manner via tooling such as infrastructure-as-code (IaC), Chaos Monkey, AWS Fault Injection Simulator or Gremlin. If you want to repeat experiments or track them via source-control, then the tooling approach is preferable, as it codifies the experiment and automates its injection and rollback.
Q: How to set up a chaos engineering day?
That’s simple – just follow our playbook!
The complexity of modern IT systems makes it impossible to predict how they will respond to every potential failure or interruption.
Chaos engineering helps organisations explore possible adverse events by designing and running controlled experiments that involve introducing a failure, then analysing the impact and response to that failure.
Chaos engineering experiments present any organisation with certain risks – introducing an error into any IT system could have unintended consequences. This means it’s essential to plan such experiments carefully. If the experiments are to be part of a focussed event, such as a Chaos Day make sure that:
- You have appropriate time to plan the experiments, and people required to plan and execute the experiments are available at the scheduled time.
- Your proposed date does not clash with major business events or planned changes to your existing IT systems
- The existing systems are stable, and will be able to recover from introduced failures or errors
These six questions are designed to help you understand when is the best time to hold a Chaos Day in your organisation:
Does the business consider a Chaos Day to be important at the current time?
Chaos Days generally provide a good return on investment for IT organisations, but the benefits might not be as obvious in the short-term as the benefit of investing in new capabilities. This can cause organisations to ‘put off’ Chaos Days, or not hold them at all.
If this is the case, then we recommend starting with a smaller investment such as a time-boxed risk assessment, using a FAIR (factor analysis to information risk) approach. This type of exercise provides an opportunity to explore what failures could happen, their frequency and the potential impact they would have. The results can give meaningful data that can be used to build a business case for Chaos Days that might persuade stakeholders to prioritise resilience over or alongside new capabilities.
Do we have enough time to plan, execute and review a Chaos Day?
It’s vital that Chaos Days are carefully scheduled to ensure that the organisation has the time and resources available to fully engage with the experiment. The timeline for a Chaos Day must allow sufficient time for planning. We recommend that you arrange a planning session with the whole team at least two weeks before the Chaos Day.
Ideally, the Chaos Day should be scheduled for a date that won’t impact key business events. For example, if the target environment is severely degraded, check this won’t delay any production releases that need to pass through it around that date. You will then need to allow time to schedule review meetings after the Chaos Day for teams, to capture learning and make recommendations.
You’ll need to schedule time to discuss and plan experiments, and decide when they should be run. You’ll need a couple of weeks after planning to allow engineering teams enough time to design, implement and test the experiments. Don’t forget to check if other teams have any key dependencies on the target environment and services over the Chaos Day, to avoid causing problems with experiments.
Do we have any major changes planned?
In general, the best time to schedule a Chaos Day is sufficiently far away from business changes to avoid disruption, but close enough to planned changes to provide useful insight and resolution time of problems that could impact those changes. For example, an online retailer could hold a Chaos Day two months before a seasonal peak in traffic, while a manufacturer could organise a Chaos Day to test its supply chain management before implementing a new shipping process.
It’s important to allow time between the Chaos Day and the business change, so that learning can be distilled and improvements applied. At one of our clients with a very large platform (1,000+ microservices processing 1 billion requests on peak days), we found that 2-3 Chaos Days per year was ideal.
Is our system currently stable?
Chaos Days offer important benefits, but if your systems are currently unstable and there are regular incidents and failures, you might already have enough chaos to contend with! In this instance, we advise focusing on improving post-incident reviews to bring about system stability. Once you’ve had a few months free of issues, then you could try a small Chaos Day in pre-production to test that stability.
Are the right people available to support our Chaos Day?
To get the best results from a Chaos Day you’ll need experienced engineers across relevant teams, to design and execute experiments. Ensure that these people are not currently committed to other tasks at the time you will need them for your Chaos Day.
How much time do we need for a Chaos Day?
To identify the best time to run a Chaos Day, first decide whether you need just a few hours, a whole day, or whether the event should be spread over several days.
Running a Chaos Day over a single day results in a more intensive but shorter event, which can add stress. Our experience is that this intensity also improves team dynamics and leads to a more memorable event. The disadvantage of a single Chaos Day is that it can be hard to maintain an element of surprise, especially in a pre-production environment. Teams need to be informed to treat failures in this environment as though production were on fire, so it’s likely they’ll be paying close attention.
Spreading a Chaos Day over a number of days makes it easier to spring surprises on the team, because people won’t know exactly when experiments will be run during the chosen period. It also allows for adjustments to be made to experiments, using learning from early experiments to improve later ones.
For more insight into the benefits of running a Chaos Day, along with expert guidance on how and when to organise Chaos Days for the maximum benefit, check out the playbook online here.
As technology professionals, stability is the gold standard. But once in a while, it is important to create chaos.
A Chaos Day is an opportunity to carry out carefully planned experiments that introduce errors and turbulent conditions in IT systems, such as terminating a compute instance or filling up a storage device. It’s a useful exercise for any organisation that wants to understand the impact and system response, and then use that understanding to improve reliability and resilience.
Wondering whether you could benefit from a Chaos Day? Here are six reasons why your organisation should run a Chaos Day this year:
1: To prepare for the unexpected
The main benefit of a Chaos Day is that it helps the organisation prepare for inevitable failures and unexpected events, before they actually do occur. In today’s complex IT environments, turbulence will happen due to single point failures or multiple, unrelated failures – often combined with sudden changes in external pressure, such as traffic spikes or security threats.
2: To analyse how your team prepares for and responds to problems
Carrying out a Chaos Day allows you to view, analyse and improve how your team responds to unexpected, turbulent conditions. It can provide a safe way to identify gaps in your teams’ skills around collaborating, communicating and thinking during high-stress periods.
3: To improve skills and knowledge across your IT team
During a Chaos Day, you can expect your team to gain:
- New knowledge about system behaviour
- Expertise in diagnosing and resolving incidents
- Better skills around collaboration and communication
- Greater understanding of system failures and recovery
Teams will also share knowledge while working through problems on a Chaos Day, meaning each team has a better understanding of their colleagues’ knowledge and skills. Chaos Days can also improve technical knowledge, which can be used to make changes that boost resilience. For example, a chaos day can illustrate the usefulness of new features such as retry mechanisms and circuit breakers.
4: To build resilience
The ultimate goal of a Chaos Day is to build resilience through greater understanding of system behaviour and failure scenarios when tackling production incidents or developing system enhancements.
Chaos Days improve system resilience by improving:
- The skills, knowledge and understanding of your team
- Processes, by guiding improvements in incident management, analysis and engineering
- Products, by initiating changes that make services more resilient, and by improving documentation such as error messages and runbooks
For more insight into the benefits of running a Chaos Day, along with expert guidance on how and when to organise Chaos Days for the maximum benefit, check out the playbook online here.