incident management blog lead
Steve Smith
Steve Smith Head of Scale Service

Our Thinking Tue 30th August, 2022

Why product teams still need major incident management

You’ve probably heard of You Build It You Run It before. It’s an operating model that empowers product teams to own every aspect of digital service management. When done well, it accelerates your time to market, increases your service reliability, and grows a learning culture.

There are also some pitfalls, which can drain the confidence of your senior leadership, and ultimately put the success of You Build It You Run It at risk.

In our You Build It You Run It playbook, my co-author Bethan Timmins and I look at a common incident management pitfall, and the risk of major incidents lacking an effective response. The good news is that you can guard against this pitfall; and if you find yourself already stuck in it, don’t panic – there is a way out!

What is incident management anyway?

If you work in a large IT department with separate Build and Run functions, you’ll have one or more operations teams managing your SaaS, COTS, and custom back office applications. In our playbook, we call this Ops Run It for your foundational systems.

You’ll probably have an L1 ops bridge team, an L2 application support team, and an incident management team. If there’s a RACI model to show who’s responsible, accountable, consulted, and informed on incident response, it’ll look like this:

Incident commander and incident manager are different roles. An incident commander is a leader of an incident response team. An incident manager is a facilitator for all incident response teams. 

During an incident, the incident manager captures incident details in your system of record (such as ServiceNow), coordinates software incident response across multiple teams when necessary, and periodically updates senior leadership and users of impact and resolution progress. They usually nominate the application support team manager as the incident commander.

It’s important to have a repeatable, reliable incident management process that’s easily followed. If you’re thinking about implementing You Build It You Run It for your digital services, it’s essential that you don’t lose sight of that repeatable, reliable process.

The no major incident management pitfall

With You Build It You Run It, your on-call product teams are accountable for incident response for their own digital services. If no effort is made to align them with your major incident management process, they can easily miss out on incident management altogether. It can be expressed as this RACI:

Symptoms of ad hoc incident response include:

  • Incident details aren’t recorded in your system of record, and can only be found in team documents (if at all)
  • Incident commander isn’t recognised as a role, and responders can make individual decisions in different directions that harm overall response efforts
  • Incident response isn’t coordinated effectively with other teams and vendors
  • Incident response varies from incident to incident, and doesn’t comply with internal regulations 
  • Incident communications are non-existent, and senior leadership doesn’t know if response efforts are progressing.

This results in inconsistent, unreliable, and time-consuming incident response efforts. The financial losses and reputational damage incurred per incident will be higher than necessary, with the result that your senior leadership will lack confidence overall in You Build It You Run It as an operating model. 

This pitfall is a consequence of high autonomy, low alignment teams. To avoid or escape it, an effort needs to be made to ensure product teams are aligned with the same major incident management process you use for your foundational systems.

You Build It You Run It with incident management

At Equal Experts, we see You Build It You Run It as a hybrid operating model. It means product teams and operations teams rely on the same operational enablers, including your incident management team.

Your product teams and incident management team need to be connected at the outset of each new digital service. Product team engineers can learn the benefits of major incident management, and identify opportunities to improve the process. One example is product teams automating incident data capture, by implementing bi-directional sync between your incident response platform, e.g. PagerDuty, and system of record, e.g. ServiceNow. That can be rolled out to all your operations and product teams, to reduce toil during incident response efforts.

In You Build It You Run It, the first on-call engineer who responds to an incident automatically becomes the incident commander. You can configure an on-call rota for your incident managers in your incident response platform, similar to a product team. This allows an incident commander to declare a major incident and contact an incident manager at the click of a button. In PagerDuty, this is known as a response play, and it means incident managers can easily coordinate between teams and manage stakeholder communications as necessary.

Once your product teams are aligned with your major incident management process, you’ll have a RACI that looks like this:

 

Aligning your You Build It You Run It product teams with your major incident management process is vital. It produces a repeatable, reliable incident management process used for digital services and foundational systems, which complies with your internal regulations and maximises the effectiveness of your incident response efforts.

To find out more, you can continue our You Build It You Run It pitfalls series below:

  1. 7 pitfalls to avoid with You Build It You Run It
  2. 5 ways to minimise your run costs with You Build It You Run It
  3. Why your head of operations shouldn’t be accountable for digital reliability
  4. How to manage BAU in product teams
  5. 4 ways to remove the treacle in change management
  6. Why product teams still need major incident management – you are here!
  7. Stop trying to embed specialists in every product team 
  8. How to avoid developer burnout on call

Our You Build It You Run It page has loads of resources on on-call product teams – case studies, conference talks, in-depth articles, and more. Plus, our You Build It You Run It playbook gives you a deep dive into how to make it happen! Get in touch, and let us know what you think.