Smoothing the Continuous Delivery Path: Identifying Vital Signs

In two previous articles in this series, I’ve looked at what I think are two vital elements of successful Continuous Delivery: a Production-focussed stand-up, and a Release Team.

While looking at the Release Team I touched upon what I’ll call the Vital Signs concept – and it’s a good example of how having a release team can lead to greater curiosity and concern for Production in general.

In short, knowing your production system’s vital signs can really help to improve your organisation’s Continuous Delivery. This article explores why this is – and how to find your system’s equivalent to a pulse rate.

It’s life Jim, but not as we know it

A working production system is like a living person. Both are active systems with vital signs that indicate their health:

A person breathes, has a pulse and a temperature. If their pulse rate goes up significantly, you know they’re working harder. If their temperature rises and stays high, it’s normally a sign that they are unwell. A temporary spike in pulse rate and temperature together is quite normal though, when highly stressed or active.
A working production system supports user transactions, makes calls to internal and external systems, logs errors and consumes resources (CPU, network, disk, etc.). A sudden rise in user transactions might lead to a rise in CPU usage, which is fine providing it remains below 100%. If CPU usage remains high in the absence of additional load, it may indicate a defect in the system.

A patient undergoing an organ transplant has a medical team closely monitoring their vital signs, to guard against life threatening changes. Production systems may not be in the operating theatre every day, but Continuous Delivery is somewhat akin to surgery (albeit hopefully of the keyhole variety!), in that a working system is having its internals altered whilst the system is still operational.

A good product team will therefore closely monitor its service’s vital signs before, during and after each release, to guard against any user-threatening changes (aka bugs).

Medical teams recognise the signs of a life-threatening condition because they’ve had years of medical training and experience. They also benefit from the significant fact that on the whole, their patients all have the same vital signs.

Product teams have the disadvantage that production systems don’t all have the same pulse rate and temperature. As a team builds its product, it needs to learn not just what its vital signs are, but also how they correlate to normal versus life-threatening conditions.

What to measure?

A challenge with monitoring any system is knowing which metrics are really the system’s vital signs. An IT system is a complex stack, with a multitude of levels that can each provide a multitude of metrics:

“Big data”
Social media
User behaviour
Business metrics
Service performance
Service errors
CPU, Disk and Network usage

Monitoring everything would be prohibitively costly and result in an overwhelming flood of data, a lot of which isn’t meaningful for our ends. Conversely, having too narrow a focus might cause a warning sign to be missed.

One pattern that can be used to resolve this dilemma is to monitor just a few vital signs from three different levels of the stack. For example:

Level	Example vital signs
Business	Number of concurrent users Conversion or drop out rate
Service	Exception or error rate (e.g. number of 5XX responses per minute) Service response times
Resource	CPU, disk and memory usage

Level

Example vital signs

Business

Number of concurrent users

Conversion or drop out rate

Service

Exception or error rate (e.g. number of 5XX responses per minute)

Service response times

Resource

CPU, disk and memory usage

To determine which vital signs to pick, you need to balance the cost of the measurement with its value:

Cost is the overall overhead of taking the measurement, and reporting and alerting on it.
Value has two axes:
- Correlation with current user experience. If errors are non-critical, then users may be oblivious to a high error rate; but if the service response time goes up, users will be frustrated by the long wait for feedback. High value metrics are those with a strong correlation to current user “happiness”.
- Time To Fix. When a developer sees a variation in the metric, how big is the cognitive gap between acknowledging the variation and knowing how to respond? Quick wins can be had via any metrics that minimise this gap. You should also ensure that you identify and address any critical measurement that has a high time to fix.

Remember, it’s all about Production

Successful Continuous Delivery is all about making small, incremental improvements to production, without making users sad. It can be a difficult practice to establish and improve on though, as it requires significant culture change. Identifying a system’s vital signs is one way of helping achieve this culture shift.

Lyndsay Prewer is a Delivery Lead at Equal Experts. A version of this post originally appeared on his own blog – Lyndsayp Ltd – where you can read more of his thoughts on iterative software delivery. You can also see him speaking on the topic at Agile on the Beach this September.

Get in touch

Solving a complex business problem? You need experts by your side.

All business models have their pros and cons. But, when you consider the type of problems we help our clients to solve at Equal Experts, it’s worth thinking about the level of experience and the best consultancy approach to solve them.

If you’d like to find out more about working with us – get in touch. We’d love to hear from you.