Our Thinking Wed 11th May, 2022
Do more with less by picking compute that reduces maintenance
To deliver digital services quickly in 2022, pick compute products that maximise the time you can spend elsewhere.
Start with a function-first approach, and review whether other compute products are worth the increase in Total Cost of Ownership (TCO).
What is Total Cost of Ownership?
Total Cost of Ownership (TCO) is the purchase price of an asset combined with the cost of operation. That is, not just the initial cost outlay, but additionally the cost to operate, maintain, and decommission the asset.
Think of it like this: when you buy a bicycle, the cost of owning it includes not just an initial price to buy the frame, wheels, and handlebars from the retailer, but also the cost of time and money to repair & maintain it. The money and time spent repairing and maintaining the bike combine with the initial price to form the TCO.
In software engineering, the Total Cost of Ownership (TCO) is the total cost to build and operate a digital service, including the initial development costs before user traffic, and the ongoing operational and development costs for the lifetime of the digital service. Staff salaries, tooling costs, implications of the process, governance, and technology choices all need to be factored into a TCO calculation.
How does the choice of compute product impact a digital service’s TCO?
As an ongoing cost during service operation, the choice of compute product type to host the service has a significant impact on a digital services’ TCO. The operational impact of the architectural, design, and compute choices can be referred to as Business As Usual (BAU) work, operational work, or maintenance work.
For example, picking AWS EKS or GCP GKE as the compute product to host your service will mean the team will perform BAU work to manage and operate a Kubernetes cluster as part of their working day, compared to picking AWS Lambda or GCP Cloud Functions, where the vendor works to manage scaling and orchestration on your behalf.
By providing a compute product, the vendor effectively moves the maintenance work from your organisation to theirs, and cost reductions are possible as economies of scale allow them to perform maintenance at a lower cost.
TCO is a critical measure of the success of a digital service, and it’s closely linked to BAU work. Reducing BAU work will reduce TCO. The lower the costs spent on BAU work, the more you can invest in the user experience and features, creating a higher return on investment.
Categorising compute products
A multitude of compute products exist that cover the differing levels of virtualisation anyone might want, including physical server hardware, infrastructure-as-a-service (IaaS), containers-as-a-service (CaaS), and functions-as-a-service (FaaS).
We can categorise compute products like EKS, GKE, AWS Lambda, physical servers, Heroku, and EC2 into the generic product types of Servers, Virtual Machines, Containers, and Functions.
Each product type has a different impact on the maintenance work the team will have to perform in order to use it, and each product type has different constraints that will enable the vendor to provide the product.
Several variant products exist at the boundaries of the above product types, typically improving the usability of the product type and offering reductions in TCO. Care must be taken where responsibility for maintenance and security patching lies with variant products, to ensure TCO isn’t accidentally increased.
Examples of variant products include functions with a container artefact format, to enable customisation of the function execution environment; for example, AWS Lambda Container Image, or containers that use buildpacks to achieve a similar level of user experience to a function such as Google App Engine or Heroku.
Visualising compute product maintenance work
The maintenance work of compute product types can be visualised as concentric circles, or a series of layers (much like an onion, matryoshka dolls, or ogres).
If you choose a server product, you’re not just responsible for providing power, cooling, storage, and networking, you also need to configure storage & networking, upgrading the OS, and patching framework vulnerabilities.
Any compute product type can be used to build a PaaS
Many organisations are building, or have built, internal PaaS centred around digital workloads (digital platforms) or data workloads (data platforms); they then face a choice on the compute product they use to underpin that PaaS.
Any compute product type can be used as a basis for an internal PaaS. It’s important to remember that you’ll need much more than just the compute product to build a user-friendly platform. Common platform functionality and capabilities are detailed in the Equal Experts’ Digital Platform playbook.
What compute should you pick today?
If you’re starting to build a digital service today, choose a function-first approach and pick FaaS products to minimise your maintenance work, so you can spend that time elsewhere.
If you have a need to customise the execution environment and are happy to spend more time maintaining everything inside your container image, it makes sense to use a CaaS product to minimise maintenance time.
You may find, as you design and build services using containers or functions, that you’ll spend more time thinking about the boundaries between services to keep them highly cohesive and loosely coupled. However, you will still enjoy a significant reduction in TCO compared to time spent on maintenance of other compute product types.
Using a function-first approach means you’re able to review other compute products when there are requirements for greater customisation. It also facilitates making informed choices quickly as to whether it’s worth increasing the time spent on maintenance to satisfy that requirement.
The total cost of ownership is significantly influenced by the choice of compute product because of the related maintenance work it creates. By choosing compute products like containers or functions you can significantly reduce the maintenance work performed by your teams.
By removing maintenance work you can use that time for higher value work such as proactively investigating failures in your services or introducing secure delivery ways of working.