We – Steve Smith and Ali Asad Lotia – are the Heads of Operability at Equal Experts (EE). We’d like to set out EE’s position on Site Reliability Engineering (SRE).
We’ll recommend the bits you should try in your organisation, mention some bits you (probably) shouldn’t try, and explain how SRE is linked to operability.
If you’re in a rush, the EE position on SRE is:
- Try availability targets, request success rate measurements, Four Golden Signals, SLIs, and SLOs.
- Maybe try an SRE advocacy team.
- Don’t try error budgets or an SRE on-call team.
And regardless of SRE, do try putting your delivery teams on call. This is better known as You Build It You Run It.
In 2004, Ben Treynors Sloss started an initiative within Google to improve the reliability of their distributed services. He advocated for reliability as a software feature, with developers automating tasks traditionally owned by operations teams. The initiative was called SRE, and it’s become widely known in recent years.
In Site Reliability Engineering by Betsey Byers et al, the authors set the scene for SRE by answering “why can’t I have 100% reliability?:”
- 100% can’t happen, because your user experience is always limited by your device (your wifi or 4G connection isn’t 100% reliable).
- 100% shouldn’t be attempted, because maximising availability limits your speed of feature delivery, and increases operational costs.
In The Site Reliability Workbook by Betsey Byers et al, Andrew Clay Shafer talks about reliability at scale, and says, ‘I know DevOps when I see it and I see SRE at Google, in theory and practice, as one of the most advanced implementations’.
Back in 2017, our CEO Thomas Granier explained why DevOps is just a conversation starter at EE. We both believe SRE is a conversation starter as well. It’s an overloaded concept. Phrases such as “SRE practice” and “SRE team” can be really confusing. Within EE, those terms have been clarified to reduce confusion.
The bits of SRE you should try
Based on our experiences, both of us recommend you try these SRE practices:
- Availability targets. Calculate an availability level on downtime cost, downtime tolerance, and engineering time, to set clear expectations of availability.
- Four Golden Signals. Focus dashboards on throughput, error rate, latency, and saturation, so operating conditions are easier to understand.
- Service Level Indicators (SLIs). Visualise targets for availability, latency, etc. on dashboards, so operational tolerances can be watched.
- Service Level Objectives (SLOs). Implement targets for availability, latency etc. as production alerts, so abnormal conditions are easily identified.
Don’t try them all at once! Run some small experiments, collect some feedback, and then adjust your approach. Availability targets are a good starting point, and Site Reliability Engineering lays out an excellent approach.
An availability target is chosen by a product manager, from a set of availability levels. First, a product manager estimates their downtime cost, based on the revenue and reputational damage of downtime. That cost is then matched to a balance between maximum tolerable downtime and required engineering time.
Engineering time stems from a valuable insight from Betsey Byers et al:
‘Each additional nine corresponds to an order of magnitude improvement toward 100% availability’
This is a powerful heuristic you can use to reason about availability targets. Like all heuristics, it’s enough for a short-term approximation that won’t be perfect. Engineering effort will always vary by service complexity.
For example, a delivery team owns a service with synchronous dependency calls. They spend three days on operational features, to harden the service until it reaches 99.0% availability. For the exact same team and exact same service, it would take up to 30 days to reach 99.9%, maybe by adding blue-green deployments and caching dependency calls. It would take 300 days for 99.99%, perhaps by reengineering dependency calls to be asynchronous and replacing the service runtime. The product manager would have to balance availability needs against three days, one month, or nine months of effort.
The bits of SRE you shouldn’t try
EE consultants strive to advise organisations on what not to try, as well as what to try. We both believe you should (probably) skip these SRE practices:
- Error budgets. Turning tolerable downtime into a budget for deployments, and halting deployments for remediation work if too many errors occur.
- SRE on-call team. Using a central delivery team of SRE developers to support services with critical traffic levels via error budgets, while other services have delivery teams on call.
These aren’t bad ideas. They’re expensive ideas. They require cultural and technology changes that take at least an order of magnitude longer than other SRE practices. We’d only consider an SRE on-call team over You Build It You Run It if an organisation had services with an ultra high downtime cost, relative to its other services. Then a 99.99% availability target and up to 100x more engineering time might be justifiable.
We’ve used the above availability table, in private and public sector organisations. We’ve asked product managers to choose availability levels based on downtime costs, their personal tolerances for downtime, and engineering time. We’ve not seen a product manager choose more than 99.9% availability and 10x engineering time. None of them anticipated a downtime cost that warranted 99.99% availability and up to 100x more engineering time.
EE doesn’t recommend an SRE on-call team, because it’s simpler and more cost effective to put delivery teams on call.
There’s a common misconception you can rebadge an existing operations team as an SRE on-call team, or an SRE advocacy team. Both of us have repeatedly advised organisations against this. Aside from the expensive cultural and technology challenges linked to both types of SRE team, adopting SRE principles requires software engineering skills in infrastructure and service management. That expertise is usually absent in operations teams.
For a given service, we both believe these are valid production support options:
It’s all about operability
In 2017, our colleague Dan Mitchell talked about operability as the value-add inside DevOps. Dan described operability as ‘the operational requirements we deliver to ensure our software runs in production as desired”. He mentioned techniques such as automated infrastructure, telemetry, deployment health, on-call delivery teams, and post-incident reviews.
Operability is a key enabler of Continuous Delivery. Continuous Delivery is about improving your time to market. A super-fast deployment pipeline won’t be much help if your production services can’t be operated safely and reliably. EE helps organisations to build operability into services, to increase their availability and their ability to cope with failures.
Operability is the value-add inside SRE.
The SRE practice of availability targets is an effective way for an organisation to genuinely understand its availability needs and downtime tolerances. Common definitions of availability and downtime need to be established, and a recognition that planned downtime is outside of downtime tolerances. This may impact the architecture of different services, as well as patching and upgrade processes.
Four Golden Signals, SLIs, and SLOs are a great way to improve your ability to cope with failures. Per-service dashboards tied to well-understood characteristics, and per-service alerts tied to availability targets can provide actionable, timely data on abnormal operating conditions.
For example, Steve recently worked with an enterprise organisation to introduce availability targets, SLO alerts, and You Build It You Run It to their 30 delivery teams and £2B revenue website. In the first year, this was 14x cheaper on support costs, 3x faster on incident response time, and 4x more effective on revenue protection. SRE was hardly mentioned.
If your organisation has a few delivery teams, we’d expect them to adopt operability practices for themselves. If you have delivery teams at scale, you might consider an SRE advocacy team, as Jennifer Strejevitch describes in how to be effective as a small SRE practice. We’ve done something similar with Digital Platform Enablement teams, as described in our Digital Platform playbook.
SRE is a real pick and mix. We believe some of its practices are really good. You should try them, to progress towards Continuous Delivery and operability. We also see some ideas that you (probably) shouldn’t try.
The EE position on SRE is:
- Do try availability targets, request success rate measurements, Four Golden Signals, SLIs, and SLOs (and don’t call it SRE if you don’t want to).
- Maybe try an SRE advocacy team (if you have delivery teams at scale).
- Don’t try error budgets or an SRE on-call team (unless you genuinely need 99.99% availability).
And with or without SRE terminology:
- Do try putting your delivery teams on call, to increase service deployments and improve production reliability.
If you’d like some advice on SRE, Continuous Delivery, or operability, get in touch using the form below and we’ll be delighted to help you.
We are often asked by our clients when is a good time to start building a digital platform. To help answer this question we’ve established minimum criteria that need to be met before funding is allocated and development work begins.
We recommend you revisit these criteria once a quarter in your first year, and once a year after that. This will help you to understand the target architecture of your Digital Platform, and continuously validate the vision for your Digital Platform.
- Multi-year funding
- Homogeneous workload
- At least one Digital Service team at outset
- Empowered teams
- Potential for five Digital Service teams
You need to be able to make allowances for multi-year funding
A Digital Platform is a significant investment. It’s a strategic asset rather than a cost-cutting liability. It’s funded as a product.
Multi-year funding is a positive signal of a commitment to continuous improvement. Without that commitment, your Digital Platform teams will not be able to redesign platform capabilities to satisfy changing user needs, or leverage new commodity cloud services to reduce costs.
You need a homogeneous workload
A Digital Platform is based on a homogeneous workload, created by multiple Digital Services. If different Digital Services have heterogeneous workloads, your Digital Platform teams will be slower to deliver new features. They will have to seek consensus between different Digital Service teams on which platform capabilities need to be enhanced. The user experience for Digital Service teams will be diminished.
For example, a Digital Platform could support Kotlin microservices and React frontends. A team might ask for data pipelines to be supported as an additional workload type, for a one-off Digital Service. That request would be politely declined by the Digital Platform teams, and there would be a collaborative effort to find an alternative solution outside the Digital Platform.
You need at least one Digital Service team from the outset
A Digital Platform starts with a minimum of one Digital Platform team and one Digital Service team. This means the first bi-directional feedback loop can be established between teams, and the initial platform features can be quickly validated.
Your first Digital Service team needs to have completed its inception phase. This ensures the Digital Service workload is sufficiently well understood to begin construction of the Digital Platform. Otherwise, the delivery of new platform features will be slowed down, due to the rework needed to focus on a different workload type.
A Digital Platform team that starts out without a Digital Service team will fall into the Premature Digital Platform Team pitfall.
You need empowered teams
A Digital Platform exists in an ecosystem in which Digital Platform teams are free to make their own technology choices. They need to work independently of any pre-approved tools, so they can experiment with new technologies that meet the particular needs of the Digital Service teams.
In a similar vein, Digital Service teams have freedom within the Digital Platform ecosystem. The Digital Platform teams build platform capabilities with sensible defaults, and Digital Service teams can configure them as necessary.
There needs to be some pragmatism. Digital Platform and Digital Service teams need to include pre-existing tools when exploring problems. However, the people best suited to make decisions are those closest to the work, and they must not be beholden to an old list of ill-suited technologies.
There should be potential for five Digital Service teams
A Digital Platform has multi-year funding linked to a recognition that at least five Digital Service teams are likely to exist in the future. In other words, there needs to be sufficient product demand for at least five distinct Digital Services within your organisation. From our experience of building Digital Platforms with multiple organisations, we believe this is the tipping point at which strategically investing in a Digital Platform is beneficial.
If there is zero potential for five or more Digital Service teams, we don’t believe a Digital Platform is the right approach. You won’t achieve the economies of scale to validate the multi-year funding. A better approach would be to invest funding and resources directly into your handful of teams, ensuring they can build and operate their services.
We hope you find this useful. For more information about Digital Platforms take a look at our Digital Platform Playbook. We thrive on feedback and welcome contributions. As you can see, we love building digital platforms! If you’d like us to share our experience with you, get in touch in the form below.
A Digital Platform optimised for the delivery of Digital Services can be an accelerator for your organisation.
The Equal Experts Digital Platform playbook is our thinking on why, when, and how to build Digital Platforms. We have found that, under the right circumstances, introducing a Digital Platform enables an organisation to achieve Continuous Delivery and Operability at scale.
Our approach is based on first-hand experience building Digital Platforms in a wide range of domains such as Government, Financial Services, Retail and Utilities and our deep expertise in helping organisations adopt Continuous Delivery and Operability principles and practices.
To be competitive, your organisation must rapidly explore new product offerings as well as exploit established products. New ideas must be validated and refined with customers as quickly as possible if product/market fit and repeatable success are to be found.
You might have multiple teams in a brownfield or greenfield IT estate, where your ability to deliver product features is constrained by your technology capabilities. In either scenario, a Digital Platform optimised for the delivery of Digital Services can be an accelerator for your organisation – if you can make a multi-year commitment to investment. A Digital Platform isn’t a small undertaking and requires ongoing funding for you to realise the greatest benefits.
Who is this playbook for
We’ve created this playbook to help you and your colleagues build a Digital Platform together. It’s for everyone in your organisation, not just software developers or operability engineers. That includes CIOs, CTOs, product managers, analysts, delivery leads, engineering managers, and more.
We’re strong proponents of cloud-native computing, serverless in all its forms, microservice architectures, and open-source technologies. However, the practices defined in our playbook are technology and vendor-agnostic, to allow you to determine the best way to adopt these ideas in the context of your organisation.
What it is about
It is worth noting that the playbook is a game plan in the sense that it is not a recipe for a single activity but an orchestration of a number of ideas that together make up a successful Digital Platform. The playbook touches on topics such as What is a Digital Platform, its capabilities, benefits and when to start building one. We have recommended principles to adopt and we outline the practices and pitfalls we’ve identified along our way.
We hope you find this and our other playbooks useful. We thrive on feedback and welcome contributions. As you can see, we love building digital platforms! If you’d like us to share our experience with you, get in touch in the form below.