How to measure product delivery success

At Equal Experts, we’re frequently asked about success measures for product delivery. It can be hard to figure out what to measure – and what not to measure!

We often find ourselves working within multiple teams that share responsibility for one product. For example, an ecommerce organisation might have Equal Experts consultants embedded in a product team, a development team, and an operations team, all working on the same payments service.

When we’re asked to improve collaboration between interdependent teams, we look at long-term and short-term options. In the long-term, we advocate moving to cross-functional product delivery teams. In the short-term, we recommend establishing shared success measures for interdependent teams.

By default, we favour measuring these shared outcomes: 

  • High profitability. A low cost of customer acquisition and a high customer lifetime value.
  • High throughput. A high deployment frequency and a low deployment lead time.
  • High quality. A low rework time percentage.
  • High availability. A high availability rate and a low time to restore availability

If your organisation is a not-for-profit or in the public sector, we’d look at customer impact aside from profitability. Likewise, if you’re building a desktop application, we’d change the availability measures to be user installer errors and user session errors

These measures have caveats. Quantitative data is inherently shallow, and it’s best used to pinpoint where the right conversations need to happen between and within teams. What “high” and “low” mean is specific to the context of your organisation. And it’s harder to implement these measures than something like story points or incident count – and it’s still the right thing to do.

Beware per-team incentives

‘Tell me how you measure me and I will tell you how I will behave’ – Eli Goldratt

People behave according to how they’re measured. When interdependent teams have their own measures of success, people are incentivised to work at cross-purposes. Collaboration becomes a painful and time-consuming process, and there’s a negative impact on the flow of product features to customers. 

At our ecommerce organisation, the product team wants an increase in customer page views. The delivery team wants more story points to be collected. The operations team wants a lower incident count. 

This encourages the delivery team to maximise deployments thereby increasing its story points, and the operations team to minimise deployments to decrease its incident count. These conflicting behaviours don’t happen because of bad intentions. They happen because there’s no shared definition of success, so the teams have their own definitions.

Measure shared outcomes, not team outputs

All too often, teams are measured on their own outputs. Examples include story points, test coverage, defect count, incident count, and person-hours. Team outputs are poor measurement choices. They’re unrelated to customer value-add, and offer limited information. They’re vulnerable to inaccurate reporting, because they’re localised to one team. Their advantage is their ease of implementation, which contributes to their popularity.

We want to measure shared outcomes of product delivery success. Shared outcomes are tied to customers receiving value-add. They encode rich information about different activities in different teams. They have some protection against bias and inaccuracies, as they’re spread across multiple teams.   

When working within multiple teams responsible for the same product, we recommend removing any per-team measures, and measuring shared outcomes instead. This aligns incentives across teams, and removes collaboration pain points. It starts with a shared definition of product delivery success.

Define what success means

When we’re looking at inter-team collaboration, we start by jointly designing with our client what delivery success looks like for the product. We consider if we’re building the right product as well as building the product right, as both are vital. We immerse ourselves in the organisational context. A for-profit ecommerce business will have a very different measure of success than a not-for-profit charity in the education sector. 

We measure an intangible like “product delivery success” with a clarification chain. In How To Measure Anything, Douglas Hubbard defines a clarification chain as a short series of connected measures representing a single concept. The default we recommend to clients is:

product delivery success includes high profitability, high throughput, high quality, and high availability

In our ecommerce organisation, this means the product team, delivery team, and operations would all share the same measures tied to one definition of product delivery success.

These are intangibles as well, so we break them down into their constituent measures.

Pick the right success measures

It’s important to track the right success measures for your product. Don’t pick too many, don’t pick too few, and don’t set impossible targets. Incrementally build towards product delivery success, and periodically reflect on your progress.

Profitability can be measured with cost of customer acquisition and customer lifetime value. Cost of customer acquisition is your sales and marketing expenses divided by your number of new customers. Customer lifetime value is the total worth of a customer while they use your products. 

Throughput can be measured with deployment frequency and deployment lead time. Deployment frequency is the rate of production deployments. Deployment lead time is the days between a code commit and its consequent production deployment. These measures are based on the work of Dr. Nicole Forsgren et al in Accelerate, and a multi-year study of Continuous Delivery adoption in thousands of organisations. They can be automated.

Quality can be measured with rework time percentage. It’s the percentage of developer time spent fixing code review feedback, broken builds, test failures, live issues, etc. Quality is hard to define, yet we can link higher quality to lower levels of unplanned fix work. In Accelerate, Dr. Forsgren et al found a statistically significant relationship between Continuous Delivery and lower levels of unplanned fix work. Rework time percentage is not easily automated, and a monthly survey of developer effort estimates is a pragmatic approach.

Availability can be measured using availability rate and time to restore availability. The availability rate is the percentage of requests successfully completed by the service, and linked to an availability target such as 99.0% or 99.9%. The time to restore availability is the minutes between a lost availability target and its subsequent restoration. 

In our experience, these measures give you an accurate picture of product delivery success. They align incentives for interdependent teams, and encourage people to all work in the same direction. 

If your organisation is a not-for-profit or in the public sector, we’d look at customer impact aside from profitability. Likewise, if you’re building a desktop application, we’d change the availability measures to be user installer errors and user session errors

Measuring shared outcomes rather than team outputs makes collaboration much easier for interdependent teams, and increases the chances of product delivery success. It’s also an effective way of managing delivery assurance. If you’d like some advice on how to accomplish this in your own organisation, get in touch using the form below and we’ll be delighted to help you.

 

We – Steve Smith and Ali Asad Lotia – are the Heads of Operability at Equal Experts (EE). We’d like to set out EE’s position on Site Reliability Engineering (SRE).

We’ll recommend the bits you should try in your organisation, mention some bits you (probably) shouldn’t try, and explain how SRE is linked to operability.  

If you’re in a rush, the EE position on SRE is:

  • Try availability targets, request success rate measurements, Four Golden Signals, SLIs, and SLOs.
  • Maybe try an SRE advocacy team.
  • Don’t try error budgets or an SRE on-call team.

And regardless of SRE, do try putting your delivery teams on call. This is better known as You Build It You Run It

Introduction

In 2004, Ben Treynors Sloss started an initiative within Google to improve the reliability of their distributed services. He advocated for reliability as a software feature, with developers automating tasks traditionally owned by operations teams. The initiative was called SRE, and it’s become widely known in recent years. 

In Site Reliability Engineering by Betsey Byers et al, the authors set the scene for SRE by answering “why can’t I have 100% reliability?:”

  • 100% can’t happen, because your user experience is always limited by your device (your wifi or 4G connection isn’t 100% reliable).
  • 100% shouldn’t be attempted, because maximising availability limits your speed of feature delivery, and increases operational costs.

In The Site Reliability Workbook by Betsey Byers et al, Andrew Clay Shafer talks about reliability at scale, and says, ‘I know DevOps when I see it and I see SRE at Google, in theory and practice, as one of the most advanced implementations’.

Back in 2017, our CEO Thomas Granier explained why DevOps is just a conversation starter at EE. We both believe SRE is a conversation starter as well. It’s an overloaded concept. Phrases such as “SRE practice” and “SRE team” can be really confusing. Within EE, those terms have been clarified to reduce confusion.

The bits of SRE you should try

Based on our experiences, both of us recommend you try these SRE practices:

  • Availability targets. Calculate an availability level on downtime cost, downtime tolerance, and engineering time, to set clear expectations of availability. 
  • Four Golden Signals. Focus dashboards on throughput, error rate, latency, and saturation, so operating conditions are easier to understand.
  • Service Level Indicators (SLIs). Visualise targets for availability, latency, etc. on dashboards, so operational tolerances can be watched.
  • Service Level Objectives (SLOs). Implement targets for availability, latency etc. as production alerts, so abnormal conditions are easily identified. 

Don’t try them all at once! Run some small experiments, collect some feedback, and then adjust your approach. Availability targets are a good starting point, and Site Reliability Engineering lays out an excellent approach. 

An availability target is chosen by a product manager, from a set of availability levels. First, a product manager estimates their downtime cost, based on the revenue and reputational damage of downtime. That cost is then matched to a balance between maximum tolerable downtime and required engineering time. 

 

Engineering time stems from a valuable insight from Betsey Byers et al

‘Each additional nine corresponds to an order of magnitude improvement toward 100% availability’

This is a powerful heuristic you can use to reason about availability targets. Like all heuristics, it’s enough for a short-term approximation that won’t be perfect. Engineering effort will always vary by service complexity. 

For example, a delivery team owns a service with synchronous dependency calls. They spend three days on operational features, to harden the service until it reaches 99.0% availability. For the exact same team and exact same service, it would take up to 30 days to reach 99.9%, maybe by adding blue-green deployments and caching dependency calls. It would take 300 days for 99.99%, perhaps by reengineering dependency calls to be asynchronous and replacing the service runtime. The product manager would have to balance availability needs against three days, one month, or nine months of effort.

The bits of SRE you shouldn’t try

EE consultants strive to advise organisations on what not to try, as well as what to try. We both believe you should (probably) skip these SRE practices:

  1. Error budgets. Turning tolerable downtime into a budget for deployments, and halting deployments for remediation work if too many errors occur.
  2. SRE on-call team. Using a central delivery team of SRE developers to support services with critical traffic levels via error budgets, while other services have delivery teams on call. 

These aren’t bad ideas. They’re expensive ideas. They require cultural and technology changes that take at least an order of magnitude longer than other SRE practices. We’d only consider an SRE on-call team over You Build It You Run It if an organisation had services with an ultra high downtime cost, relative to its other services. Then a 99.99% availability target and up to 100x more engineering time might be justifiable. 

We’ve used the above availability table, in private and public sector organisations. We’ve asked product managers to choose availability levels based on downtime costs, their personal tolerances for downtime, and engineering time. We’ve not seen a product manager choose more than 99.9% availability and 10x engineering time. None of them anticipated a downtime cost that warranted 99.99% availability and up to 100x more engineering time. 

EE doesn’t recommend an SRE on-call team, because it’s simpler and more cost effective to put delivery teams on call. 

There’s a common misconception you can rebadge an existing operations team as an SRE on-call team, or an SRE advocacy team. Both of us have repeatedly advised organisations against this. Aside from the expensive cultural and technology challenges linked to both types of SRE team, adopting SRE principles requires software engineering skills in infrastructure and service management. That expertise is usually absent in operations teams. 

For a given service, we both believe these are valid production support options:

It’s all about operability

In 2017, our colleague Dan Mitchell talked about operability as the value-add inside DevOps. Dan described operability as ‘the operational requirements we deliver to ensure our software runs in production as desired”. He mentioned techniques such as automated infrastructure, telemetry, deployment health, on-call delivery teams, and post-incident reviews.

Operability is a key enabler of Continuous Delivery. Continuous Delivery is about improving your time to market. A super-fast deployment pipeline won’t be much help if your production services can’t be operated safely and reliably. EE helps organisations to build operability into services, to increase their availability and their ability to cope with failures.

Operability is the value-add inside SRE.

The SRE practice of availability targets is an effective way for an organisation to genuinely understand its availability needs and downtime tolerances. Common definitions of availability and downtime need to be established, and a recognition that planned downtime is outside of downtime tolerances. This may impact the architecture of different services, as well as patching and upgrade processes.

Four Golden Signals, SLIs, and SLOs are a great way to improve your ability to cope with failures. Per-service dashboards tied to well-understood characteristics, and per-service alerts tied to availability targets can provide actionable, timely data on abnormal operating conditions. 

For example, Steve recently worked with an enterprise organisation to introduce availability targets, SLO alerts, and You Build It You Run It to their 30 delivery teams and £2B revenue website. In the first year, this was 14x cheaper on support costs, 3x faster on incident response time, and 4x more effective on revenue protection. SRE was hardly mentioned.

If your organisation has a few delivery teams, we’d expect them to adopt operability practices for themselves. If you have delivery teams at scale, you might consider an SRE advocacy team, as Jennifer Strejevitch describes in how to be effective as a small SRE practice. We’ve done something similar with Digital Platform Enablement teams, as described in our Digital Platform playbook

Summary

SRE is a real pick and mix. We believe some of its practices are really good. You should try them, to progress towards Continuous Delivery and operability. We also see some ideas that you (probably) shouldn’t try. 

The EE position on SRE is:

  • Do try availability targets, request success rate measurements, Four Golden Signals, SLIs, and SLOs (and don’t call it SRE if you don’t want to).
  • Maybe try an SRE advocacy team (if you have delivery teams at scale).
  • Don’t try error budgets or an SRE on-call team (unless you genuinely need 99.99% availability).

And with or without SRE terminology:

  • Do try putting your delivery teams on call, to increase service deployments and improve production reliability.

If you’d like some advice on SRE, Continuous Delivery, or operability, get in touch using the form below and we’ll be delighted to help you.

 

Embarking on building a Digital Platform can be a rewarding experience for your organisation. It will mean you can benefit from faster innovation, higher quality (with improved reliability), reduced costs, and happy people. But it’s no small undertaking.

So, before embarking on a project, it’s important to understand the benefits and assess whether they align with your digital strategy. Here we attempt to unpack and understand the main benefits of a Digital Platform, to help you come to a better understanding of whether it is right for you.

There are many benefits, so we have broken them down into the following sections. 

A Digital Platform allows faster innovation

  • Faster time to launch. Automating and abstracting cloud setup and simplifying governance processes means a new Digital Service can be launched to customers within days.
  • Frequent updates. Creating an optimal deployment pipeline allows customer experiments in a Digital Service to be updated on at least a daily basis.
  • Increased focus on business problems. Institutionalising new policies that cross-cut departments means uncoordinated and/or duplicated processes can be eliminated, and people can focus on higher-value work.
  • More business model opportunities. Friction-free, rapid launches of Digital Services allow an organisation to separate its differentiating business functions from utilities and to quickly trial different business models in new marketplaces.

It invariably provides a higher quality solution

  • Fewer environmental issues. Automating configuration and infrastructure lowers the potential for environment-specific problems.
  • More deterministic test results. Centralising automated test executors reduces opportunities for nondeterminism in test suites.
  • Faster rollback. Creating an effective rollback system with health checks means deployment failures can be fixed quickly.

You will benefit from increased reliability

  • More operable services. Providing logging, monitoring, and alerting out of the box increases the operability of Digital Services, and helps users to quickly discern abnormal operating conditions.
  • Graceful degradation. Implementing circuit breakers and bulkheads on the wire for third-party systems allows Digital Services to gracefully degrade on failure.
  • Improved business continuity. Automating the entire platform infrastructure in the cloud creates new business continuity options.

Improved ways of working

  • Policy experimentation. Cutting across departments means new policies can be forged in inceptions, Chaos Day testing, secure delivery, and more. 
  • Drive new practices. Creating enabling constraints in user journeys can drive the adoption of new practices, such as restricting shared libraries to encourage decoupled domains for Digital Services.
  • Simpler processes. Establishing meaningful Service Level Objectives with an automated alerting toolchain can make You Build It You Run It production support easier to set up.

Take advantage of the most advanced technology

  • Use the best available technologies. Standardising cloud building blocks means the best available technology stack can be provided to Digital Service teams.
  • Traffic optimisations. Surfacing self-service, elastic infrastructure means Digital Service teams can easily optimise for fluctuating traffic patterns without significant costs.
  • Zero downtime updates. Consolidating service runtimes means functional updates can be continually applied with zero downtime for Digital Services.

Benefitting from reduced costs

  • Economies of scale. Centralising the Digital Service lifecycle means economies of scale can be achieved, as more Digital Service teams can be added without incurring repeat buy/build costs.
  • Easier cost management. Centralising self-service touchpoints for automated infrastructure allows infrastructure costs to be visualised and closely managed. 
  • Positioning security specialists in the Digital Platform teams means security threats can be more easily identified and Digital Services can quickly receive security updates. 

Ultimately you will have happier, more productive people

  • Lower cognitive load. Abstracting away the Digital Service lifecycle reduces your staff’s cognitive load, reducing lead times to less than 24 hours for a new joiner, a mover between teams, a leaver, or a new Digital Service team.
  • Easier to identify talent needs. Splitting business domains into Digital Services helps to highlight which domains are true business differentiators and require the most talented engineers.
  • Increased talent attractors. Using the latest cloud technology on Digital Platform and Digital Service teams will encourage talented engineers to join your organisation.
  • More recruitment options. Concentrating specialist skills in Digital Platform teams means recruitment efforts for Digital Service teams can focus on onshore/offshore developers, testers, etc. without requiring more costly, specialised cloud skills.

Contact us! 

We hope you find this useful. For more information about Digital Platforms take a look at our Digital Platform Playbook. We thrive on feedback and welcome contributions.   As you can see, we love building digital platforms!  If you’d like us to share our experience with you, get in touch in the form below.

 

We are often asked by our clients when is a good time to start building a digital platform. To help answer this question we’ve established minimum criteria that need to be met before funding is allocated and development work begins.

We recommend you revisit these criteria once a quarter in your first year, and once a year after that. This will help you to understand the target architecture of your Digital Platform, and continuously validate the vision for your Digital Platform.

  1. Multi-year funding
  2. Homogeneous workload
  3. At least one Digital Service team at outset
  4. Empowered teams
  5. Potential for five Digital Service teams

You need to be able to make allowances for multi-year funding

A Digital Platform is a significant investment. It’s a strategic asset rather than a cost-cutting liability. It’s funded as a product. 

Multi-year funding is a positive signal of a commitment to continuous improvement. Without that commitment, your Digital Platform teams will not be able to redesign platform capabilities to satisfy changing user needs, or leverage new commodity cloud services to reduce costs.

You need a homogeneous workload

A Digital Platform is based on a homogeneous workload, created by multiple Digital Services. If different Digital Services have heterogeneous workloads, your Digital Platform teams will be slower to deliver new features. They will have to seek consensus between different Digital Service teams on which platform capabilities need to be enhanced. The user experience for Digital Service teams will be diminished.

For example, a Digital Platform could support Kotlin microservices and React frontends. A team might ask for data pipelines to be supported as an additional workload type, for a one-off Digital Service. That request would be politely declined by the Digital Platform teams, and there would be a collaborative effort to find an alternative solution outside the Digital Platform. 

You need at least one Digital Service team from the outset

A Digital Platform starts with a minimum of one Digital Platform team and one Digital Service team. This means the first bi-directional feedback loop can be established between teams, and the initial platform features can be quickly validated. 

Your first Digital Service team needs to have completed its inception phase. This ensures the Digital Service workload is sufficiently well understood to begin construction of the Digital Platform. Otherwise, the delivery of new platform features will be slowed down, due to the rework needed to focus on a different workload type. 

A Digital Platform team that starts out without a Digital Service team will fall into the Premature Digital Platform Team pitfall.

You need empowered teams

A Digital Platform exists in an ecosystem in which Digital Platform teams are free to make their own technology choices. They need to work independently of any pre-approved tools, so they can experiment with new technologies that meet the particular needs of the Digital Service teams. 

In a similar vein, Digital Service teams have freedom within the Digital Platform ecosystem. The Digital Platform teams build platform capabilities with sensible defaults, and Digital Service teams can configure them as necessary. 

There needs to be some pragmatism. Digital Platform and Digital Service teams need to include pre-existing tools when exploring problems. However, the people best suited to make decisions are those closest to the work, and they must not be beholden to an old list of ill-suited technologies. 

There should be potential for five Digital Service teams

A Digital Platform has multi-year funding linked to a recognition that at least five Digital Service teams are likely to exist in the future. In other words, there needs to be sufficient product demand for at least five distinct Digital Services within your organisation. From our experience of building Digital Platforms with multiple organisations, we believe this is the tipping point at which strategically investing in a Digital Platform is beneficial.

If there is zero potential for five or more Digital Service teams, we don’t believe a Digital Platform is the right approach. You won’t achieve the economies of scale to validate the multi-year funding. A better approach would be to invest funding and resources directly into your handful of teams, ensuring they can build and operate their services.

Contact us!

We hope you find this useful. For more information about Digital Platforms take a look at our Digital Platform Playbook. We thrive on feedback and welcome contributions. As you can see, we love building digital platforms! If you’d like us to share our experience with you, get in touch in the form below.

A Digital Platform allows your organisation to accelerate its time to market, increase revenue, reduce costs, and create innovative products for your customers.

Equal Experts sees Digital Platforms as an essential part of the IT landscape across both the public and private sectors.  Read on to understand what a digital platform is, and how they empower Digital Services across an organisation.

At Equal Experts, we define a Digital Platform as:

A Digital Platform is a bespoke Platform as a Service (PaaS) product composed of people, processes, and tools, that enables teams to rapidly develop, iterate, and operate Digital Services at scale. 

A Digital Platform is a powerful tool and when used correctly it is:

  • Differentiating. It empowers your teams to concentrate on solving real business problems by abstracting away organisational complexities and toil.
  • A product. It’s built incrementally by incorporating feedback from your teams. It accelerates the delivery of Digital Services. It’s enduring.
  • Opinionated. It makes it easy for your teams to build, deploy, and operate Digital Services by providing a curated set of high-quality building blocks.

It’s also important to understand what a Digital Platform is not:

  • Not a commodity. It cannot be bought off the shelf, as it must satisfy the specific needs of your organisation. It’s built by weaving together open-source and bespoke commodity tools to create a technology accelerator.
  • Not a project. It isn’t a one-off development with a fixed end date. It needs to keep changing, as the needs of your teams will change based on their customers’ demands.
  • Not a universal infrastructure platform. It cannot run all cloud services for all possible consumers without weakening the proposition. It needs to focus on a subset of cloud services to support Digital Service workloads.

It’s important to remember that a Digital Platform isn’t a silver bullet. It’s a long-term commitment to Digital Services at scale. It’s not appropriate for all workloads, teams, or organisations. For more on this, see when to start a Digital Platform.

A Digital Service is a software service designed to fulfil a product capability and run on a Digital Platform. Such a service might be a monolith, or composed of multiple microservices. It’s usually based on modern software development principles, such as 12 Factor or Secure Delivery. It’s owned by a single Digital Service team responsible for understanding its customers, and producing a service that meets their needs.

As a good example, here’s a services diagram of a fictional Digital Platform in a retail organisation. It shows eight Digital Services in development within two different retail domains, as well as six platform capabilities within the Digital Platform itself.

A fictional Digital Platform

Fig 1: Digital Services on a Digital Platform

Bespoke

A Digital Platform is bespoke. It’s something unique, built solely for the Digital Service teams in your organisation. It’s founded on custom building blocks made by your Digital Platform teams, and commodity cloud services from your public cloud. It’s about peoples, processes, and tools coming together to form platform capabilities. A public cloud can’t provide you with a Digital Platform out of the box. Nor can an off the shelf product from a vendor. But there are many advantages and opportunities that come with a public cloud as a foundation for a Digital Platform.

Paved Road

A Digital Platform is a set of Paved Roads. Each Paved Road consists of low-friction, hardened interfaces that comprise user journeys for Digital Service teams (e.g. build a service, deploy a service, or service alerts). Those paved user journeys are fully automated and encompass the learned best practices specific to your organisation. 

A Paved Road is built incrementally by Digital Platform teams. Each platform capability is delivered in small increments, and adjustments are made based on user feedback. Over time, as each platform capability becomes more opinionated, the Paved Road becomes wider and longer. Enabling constraints are used to encourage frequent production deployments and high standards of reliability for long-lived Digital Services.

A Paved Road eliminates common failure modes, by automating repetitive tasks. It encourages the adoption of Continuous Delivery and Operability practices, such as constant monitoring of live traffic, and steers away from pitfalls such as End-To-End Testing. It challenges Digital Service teams to rethink how they approach particular problems, and contribute enhancements and features back into the Paved Road experience. 

Bi-directional feedback

A Digital Platform is primarily about the people who build it and use it. It exists to satisfy its users’ needs, through technical or non-technical means. The value of its capabilities is derived from the ability of its Digital Platform teams to talk to and learn from its Digital Service teams. It’s the responsibility of the Digital Platform teams to create an ecosystem of bi-directional feedback loops. User feedback allows Digital Platform teams to better understand which technology building block or organisational process needs to be improved, and industrialised so that all teams can benefit. 

For example, feedback from your Digital Service teams might include complaints about a historical, time-consuming change-approvals process in your organisation owned by an overworked change management team. Your Digital Platform needs to provide an automated deployment pipeline that acts as an automated audit trail. If your Digital Platform teams can present a live audit trail that reduces toil for the change management team, their needs might be met by a streamlined, self-service process, in which Digital Service teams peer-review their own change requests.

Contact us!

We hope you find this useful. For more information about Digital Platforms take a look at our Digital Platform Playbook. We thrive on feedback and welcome contributions. As you can see, we love building digital platforms! If you’d like us to share our experience with you, get in touch in the form below.

A Digital Platform optimised for the delivery of Digital Services can be an accelerator for your organisation. 

The Equal Experts Digital Platform playbook is our thinking on why, when, and how to build Digital Platforms.  We have found that, under the right circumstances, introducing a Digital Platform enables an organisation to achieve Continuous Delivery and Operability at scale.

Our approach is based on first-hand experience building Digital Platforms in a wide range of domains such as Government, Financial Services, Retail and Utilities and our deep expertise in helping organisations adopt  Continuous Delivery and Operability principles and practices.

To be competitive, your organisation must rapidly explore new product offerings as well as exploit established products. New ideas must be validated and refined with customers as quickly as possible if product/market fit and repeatable success are to be found.

You might have multiple teams in a brownfield or greenfield IT estate, where your ability to deliver product features is constrained by your technology capabilities. In either scenario, a Digital Platform optimised for the delivery of Digital Services can be an accelerator for your organisation – if you can make a multi-year commitment to investment. A Digital Platform isn’t a small undertaking and requires ongoing funding for you to realise the greatest benefits.

Who is this playbook for

We’ve created this playbook to help you and your colleagues build a Digital Platform together. It’s for everyone in your organisation, not just software developers or operability engineers. That includes CIOs, CTOs, product managers, analysts, delivery leads, engineering managers, and more.

We’re strong proponents of cloud-native computing, serverless in all its forms, microservice architectures, and open-source technologies. However, the practices defined in our playbook are technology and vendor-agnostic, to allow you to determine the best way to adopt these ideas in the context of your organisation.

What it is about

It is worth noting that the playbook is a game plan in the sense that it is not a recipe for a single activity but an orchestration of a number of ideas that together make up a successful Digital Platform.  The playbook touches on topics such as What is a Digital Platform, its capabilities, benefits and when to start building one.  We have recommended principles to adopt and we outline the practices and pitfalls we’ve identified along our way.

Contact us! 

We hope you find this and our other playbooks useful. We thrive on feedback and welcome contributions.   As you can see, we love building digital platforms!  If you’d like us to share our experience with you, get in touch in the form below.

In the last few weeks, I made a deep dive into Infrastructures as Code. I chose AWS and Terraform to write my provisioning scripts.

It came naturally to me, as a software engineer, to write Terraform code. A lot of software design principles (like KISS, DRY or even SOLID to some extent) can be adapted to write quality IaC. The intention is to end up with small, decoupled modules, used as building blocks for provisioning environments. Still, I felt a bit uncomfortable without TDD or automated tests at all. Are automated IaC tests useful (besides improving my well-being)?

Developers, DevOps, Software Engineers, we always verify if our code works as expected, even if not in an automated manner. In the case of Terraform, running terraform plan (a command that creates an execution plan, but doesn’t deploy anything) checks the syntax and if all resources are correctly defined. 

Manual functional testing involves logging into the management console and verifying all properties of the deployed resources, verifying access-control lists, connectivity tests, etc. This is time-consuming and cumbersome, but necessary.

Operability practices aim to support frequent deployments. This also means constant changes to the underlying infrastructure. In this case, manual testing of the IaC is inefficient and may not add as much value as expected. For this reason, I decided to take some time and test the automated testing tools for IaC. Below, I will talk about three valuable lessons I learned.

1. Testing modules with Terratest isn’t even close to unit testing.

The tool of choice for automated Terraform tests is Gruntwork’s Terratest. It is a Golang framework, actively growing and gaining popularity.

In the beginning, it was tempting to think about module tests like there were unit tests. When you unit-test a function used in your application, you don’t need to run the application. The tests are short, simple, and examine a particular piece of code in isolation (also with input values that yield errors.) Correctness means that the output of the function under test is as expected. We care about broad test coverage.

Module testing in Terratest is different. You write an example infrastructure code using the module you want to verify. Terratest deploys the code and runs your tests against it. The tests should answer the question: “does the infrastructure actually work?” For example: if your module deploys a server with running application, you could send some traffic to it to verify if it responds as expected.

Examining resource’s properties (loaded from an API) is rarely practiced with Terratest. It can be useful when false-negative test results may introduce some kind of high risk. As a result module testing with Terratest is looks almost like end-to-end testing.

2. There are other tools to complement Terratest.

Sometimes, end-to-end tests are not enough. For example, your private network accidentally has a route to the gateway. To confirm, that the private network is really private, it feels convenient to check if no routes in its routing table let public traffic. 

You could also picture a situation where the operation team lets the development teams create own resources. You may need to ensure that the implemented code follows security standards and compliance obligations. E.g all resources should be correctly tagged, all storage should be encrypted, some ports should never be open, etc.

In addition to Terratest, several other testing tools are more convenient to test specific resource properties. One of them is terraform-compliance. You can encode all your policies in “spec” files similar to Cucumber specs and run them against the output of terraform plan


Feature: Define AWS Security Groups
  In order to improve security
  As engineers
  We will use AWS Security Groups to control inbound traffic

  Scenario: Policy Structure
    Given I have AWS Security Group defined
    Then it must contain ingress

  Scenario: Only selected ports should be publicly open
    Given I have AWS Security Group defined
    When it contains ingress
    Then it must only have tcp protocol and port 22,443 for 0.0.0.0/0

This spec would yield an error if any of your security groups allow inbound traffic on a port different than 22 or 443.

If you feel more comfortable testing deployed resources, and you work with AWS, you could try AWSSpec. AWSSpec is built on top of Ruby’s RSpec testing framework. The tests are specs-alike, BDD style. The difference is that you run them against the real infrastructure. Similarly to the Terratest, if you would test modules you need to deploy examples first. You could automate the deployment and verification using Test-Kitchen (along with Kitchen-Terraform plugin). For example,  testing a private subnet may look like this:

require 'spec_helper'
describe subnet('Default Private Subnet') do
  it { should exist }
  its(:cidr_block) { should eq '10.10.2.0/24' }
  its(:state) { should eq 'available' }
end
describe route_table('Private Subnet Route Table') do
  it { should exist }
  it { should have_subnet('Default Private Subnet') }
  it { should have_route('10.10.0.0/16').target(gateway: 'local') }
  it { should_not have_route('0.0.0.0/0')}
end

Executing the test may show following output:

3. Automated IaC tests are expensive

The cost of IaC testing doesn’t only include the charges for the resources deployed for testing. For writing automated IaC tests you need good programming skills, that may go beyond one programming language. (Terratest uses Golang, terraform-compliance uses Python, AWSSpec uses Ruby, etc.)

Writing terraforms tests is time-consuming. The cloud APIs aren’t convenient to use, and helpers libraries may miss important functions. In the case of Terratest, and AWSSpec there is a lot of additional infrastructure code needed for module testing.

Many tools, although quite useful, aren’t yet mature. There is always a danger that they will cease to work with newer versions of Terraform or just be discontinued.

Summary

Should I recommend investing time and money into automated IaC testing? That depends. First of all, your team should focus on using Terraform the right way. This means no direct, manual changes to the infrastructure.

Once delivering IaC works well, your team may consider adding automated tests.

If a change may introduce any kind of risk that can’t be accepted, then it’s a good candidate for an automated test. Another factor to consider is the team topology. If the IaC ownership is decentralised then automated tests may help to ensure code consistency, compliance, and quality.

Is it OK to give up automated IaC testing for now? If you can’t introduce automated IaC testing you can rely on other confirmation techniques like green/blue deployments with comprehensive monitoring. Although they do not substitute each other both can help verify the correctness of the infrastructure.