Chaos Days are an opportunity to introduce disruption to your IT systems, so that you can understand how they will respond to possible ‘real’ disruptions. Of course, it’s also a highly effective way for teams to practice and improve how they respond to IT failures.
In this article, we’ll discuss some of the most common questions about Chaos Days and how they can be used to improve IT service resilience. If you’d like to find out more about how to plan, organise and run your own Chaos Day, don’t miss our Chaos Days Playbook, which you can download for free.
Q: What is a Chaos Day?
A Chaos Day is an event that runs over one or more days where teams can explore how their service responds to failures safely. During a Chaos Day, teams design and run controlled experiments in pre-production or production environments. Each experiment injects a failure into the service, such as terminating a compute instance, or filling up a storage device) and the team observes and analyses the impact and overall system response, and the response of the supporting team. Chaos Days are a practice within the field of chaos engineering.
Q: What is chaos engineering?
Chaos engineering is defined as the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. Chaos engineering teams design and run experiments that inject failures into a target system, so the team can learn how the system responds. This learning improves the resilience of the system by:
- Equipping the team with deeper understanding about system behaviour
- Informing the team on where to invest in order to improve system resilience
Q: Why do we organise Chaos Days?
Chaos Days provide a focal event for your team to practice chaos engineering. They are especially useful to teams that might be less familiar with this discipline, because they introduce chaos engineering in a structured, boundaried manner.
Chaos Days improve system resilience by helping your people learn about systems, and gain experience in how to diagnose and solve problems in high-stress situations. They provide an opportunity to improve processes such as incident management, incident analysis and engineering approaches, such as how faults should be handled and how resilience testing is performed during feature development.
Finally, Chaos Days help organisations to initiate changes that make services more resilient, improve observability and make services and dependencies better understood.
Q: How do you implement chaos engineering?
Chaos engineering is a broad and deep discipline, to which our Chaos Day playbook provides a great introduction, including a 5-minute guide to running a Chaos Day. Once you’ve digested that, the simplest next steps are to:
- Decide which part of your system you want to learn more about.
- Come up with a hypothesis for how that part responds to specific failures.
- Design and run an experiment to test that hypothesis, by injecting a failure into that part of the system. The failure injection can be manual (e.g. stop a service the system depends on) or automated (e.g. use infrastructure-as-code to remove access to the service for the duration of the experiment).
- Observe how the system responds to the failure and review as a team what was learnt from this experiment and any changes you should make as a result of it.
- Rinse and repeat.
Q: What are some top tips for Chaos Days?
- Start small, with one or two teams and a few experiments, not tens of teams and tens of experiments. This allows you to adapt and learn how to run a Chaos Day in your specific context, before scaling out to multiple teams and many experiments.
- Plan ahead – it’s possible to run a mini chaos event in a single day, but you’ll get the most from any chaos event by scheduling time in advance to design and run experiments, then reflect and share the lessons extracted from them.
- Spread knowledge by involving the whole team, but limiting how much diagnosis and repair your most experienced engineers do – either treat them as absent for that day or pair them with less experienced team members.
- Be conscious of business critical events that the chaos might impact (especially if it gets out of control). Also, allow time to return the system to its normal state. You don’t want to take down a key environment just when it’s needed for a critical release.
Q: What tools are available for running a Chaos Day? How should we run a Chaos Day if we’re running AWS?
The experiments you run during a Chaos Day typically modify system configuration or behaviour in some way that simulates a failure (e.g. shutting down a compute instance, closing a network connection). These modifications can either be done manually (e.g. through temporarily editing configuration) or in a more automated manner via tooling such as infrastructure-as-code (IaC), Chaos Monkey, AWS Fault Injection Simulator or Gremlin. If you want to repeat experiments or track them via source-control, then the tooling approach is preferable, as it codifies the experiment and automates its injection and rollback.
Q: How to set up a chaos engineering day?
That’s simple – just follow our playbook!
In our recent Operationalising ML Playbook we discussed the most common pitfalls during MLOps. One of the most common pitfalls? Failing to implement appropriate secure development at each stage of MLOps.
Our Secure Development playbook describes the practices we know are important for secure development and operations and these should be applied to your ML development and operations.
In this blog we will explore some of the security risks and issues that are specific to MLOps. Make sure you check them all before publishing your model into production.
In machine learning, systems use example data to try to learn something – which may be output as a prediction or insight. The examples used to train ML models are known as training datasets, and security issues can be broadly divided into those affecting the model before and during training, and those affecting models that have already been trained.
Vulnerability to data poisoning or manipulation
One of the most commonly discussed security issues in MLOps is data poisoning – this is an attack where hackers attempt to corrupt or manipulate the data used for training ML models. This might be by switching expected responses, or adding new responses into a system. The result of data poisoning is that data confidentiality and reliability are both damaged.
When data for ML models is collected from online sources from sensors or online sources, the risk of data poisoning can be extremely high. Attacks can include label flipping (data is poisoned by changing labels in data) and gradient descent attacks (where the ability of a model to understand how close it is to predicting the correct answer is damaged by either making the model falsely believe it’s found the answer, or by preventing it from finding the answer by constantly changing that answer).
Exposure of data in the pipeline
You will certainly need to include data pipelines as part of your solution. In some cases they may use personal data in the training. Of course these should be protected to the same standards as you would in any other development. Ensuring the privacy and confidentiality of data in machine learning models is critical to protect against data extraction attacks and function extraction attacks.
Making the model accessible to the whole internet
Making your model endpoint publicly accessible may expose unintended inferences or prediction metadata that you would rather keep private. Even if your predictions are safe for public exposure, making your endpoint anonymously accessible may present cost management issues. A machine learning model endpoint can be secured using the same mechanisms as any other online service.
Embedding API Keys in mobile apps
A mobile application may need specific credentials to directly access your model endpoint. Embedding these credentials in your app allows them to be extracted by third parties and used for other purposes. Securing your model endpoint behind your app backend can prevent uncontrolled access.
As with most things in development, it only takes one person to neglect MLOps security to compromise the entire project. We advise organisations to create a clear and consistent set of governance rules that protect data confidentiality and reliability at every stage of an ML pipeline.
Everyone in the team needs to agree on the right way to do things – it only takes one leak or data attack for the overall performance of a model to be compromised.