Why Kubernetes, Kafka, or Istio can derail your platform engineering efforts

Platform engineering means creating user-centric capabilities that enable teams to achieve their business outcomes faster than ever before. At Equal Experts, we’ve been doing platform engineering for a decade, and we know it can be an effective solution to many scaling problems. 

Unfortunately, it’s easy to get platform engineering wrong. There are plenty of pitfalls, which can contaminate your engineering culture and prevent you from sustainably scaling your teams up and down. In this series, I’ll cover some of those pitfalls, starting with the power tools problem.

How to measure a platform capability

A platform capability mixes people, processes, and tools (SaaS, COTS, and/or custom code) to provide one or more enabling functions to your teams. In order to stay user-centered and focussed on your mission, you need to measure a capability in terms of: 

  • Internal customer value. How much it improves speed, reliability, and quality for your teams. The higher this is, the faster your teams will deliver.
  • Internal customer costs. How much unplanned tech work it creates for your teams. The lower this is, the more capacity your teams will have.
  • Platform costs. How much build and run work it creates for your platform team. The lower this is, the fewer platform engineers you’ll need.

Whether it’s data engineering or a microservices architecture, it’s all too easy for your well-intentioned platform team to make the wrong trade-offs, and succumb to a pitfall. Here’s one of those tough situations. 

The hidden costs of power tools

Implementing core platform capabilities with power tools like Kubernetes, Kafka, and/or Istio is one of the biggest pitfalls we regularly see in enterprise organizations. Power tools are exciting and offer a lot of useful features, but unless your service needs are complex and your platform team knocks it out of the park, those tools will require a lot more effort and engineers than you’d expect. 

Here’s a v1 internal developer platform, which uses Kubernetes for container orchestration, Kafka for messaging, and Istio for service mesh. A high level of internal customer value is possible, but there are also high internal customer costs and a high platform cost. It’s time-consuming to build and maintain services on this platform.

Version1 of an internal developer platform. A large and heavy weight containing Kubernetes, Istio and Kafta capabilities. On the right is a horizontal bar chart showing the high levels of internal customer value, internal customer costs and platform costs of heavyweight power tools.

This pitfall happens when your platform team prioritizes the tools they want over the capabilities your teams need. Teams will lack capacity for planned product work, because they have to regularly maintain Kubernetes, Kafka, and/or Istio configurations beyond their core competencies. And your platform team will require numerous engineers with specialized knowledge to build and manage those tools. Those costs aren’t usually measured, and they slowly build up until it’s too late.

For example, we worked with a Dutch broadcaster whose teams argued over tools for months. The platform team wanted Kubernetes, but the other teams were mindful of deadlines and wanted something simpler. Kubernetes was eventually implemented, without a clear business justification. 

Similarly, a German retailer used Istio as their service mesh. The platform team was nervous about upgrades, and they waited each time for a French company to go first. There was no business relationship, but the German retailer had a documented dependency on the French company’s technology blog.

Transitioning from heavyweight to lightweight tools

You escape the power tools pitfall by replacing your heavyweight capabilities with lightweight alternatives. Simpler tools can deliver similar levels of internal customer value, with much lower costs. For example, transitioning from Kubernetes to ECS can reduce internal customer costs as teams need to know less and do less, and also lower your platform costs as fewer platform engineers are required. 

Here’s a simple recipe to replace a power tool with something simpler and lower cost. For each high-cost capability, use the standard lift and shift pattern:

  • Declare it as v1, and restrict it to old services
  • Rebuild v1 with lightweight tools, and declare that as v2
  • Host new services on v2
  • Lift and shift old services to v2
  • Delete v1

As with any migration, resist the temptation to put new services onto v1, and design v2 interfaces so migration costs are minimized. Here’s v2 of the imaginary developer platform, with Fargate, Kinesis, and App Mesh replacing Kubernetes, Kafka, and Istio. Capability value remains high, and costs are much lower.

The heavy weight containing platform capabilities in version 1 has been transitioned to lightweight platform capabilities, demonstrated in v2 with App mesh, Kinesis and Fargate in bubbles. The impact of this is shown in a horizontal bar chart comparing the high internal customer and platform costs of the heavyweight capabilities with the lower costs in the lightweight system.

Conclusion

Power tools are a regular pitfall in platform engineering. Unless your platform team can build and run them to a high standard, they’ll lead to a spiral of increasing costs and operational headaches. Transitioning to lighter, more manageable solutions means you can achieve a high level of internal consumer value as well as low costs. 

A good thought experiment here is “how many engineers want to build and run a Kubernetes, Kafka, or Istio a second time?”. My personal experience is not many, and that’s taking managed services like EKS and Confluent into account.

I’ll share more platform engineering insights in my talk “Three ways you’re screwing up platform engineering and how to fix it” at the Enterprise Technology Leadership Summit Las Vegas on 20 August 2024. If you’re attending, I’d love to connect and hear about your platform engineering challenges and solutions.

To deliver digital services quickly in 2022, pick compute products that maximise the time you can spend elsewhere.

Start with a function-first approach, and review whether other compute products are worth the increase in Total Cost of Ownership (TCO).

What is Total Cost of Ownership?

Total Cost of Ownership (TCO) is the purchase price of an asset combined with the cost of operation. That is, not just the initial cost outlay, but additionally the cost to operate, maintain, and decommission the asset.

Think of it like this: when you buy a bicycle, the cost of owning it includes not just an initial price to buy the frame, wheels, and handlebars from the retailer, but also the cost of time and money to repair & maintain it. The money and time spent repairing and maintaining the bike combine with the initial price to form the TCO.

In software engineering, the Total Cost of Ownership (TCO) is the total cost to build and operate a digital service, including the initial development costs before user traffic, and the ongoing operational and development costs for the lifetime of the digital service. Staff salaries, tooling costs, implications of the process, governance, and technology choices all need to be factored into a TCO calculation. 

How does the choice of compute product impact a digital service’s TCO?

As an ongoing cost during service operation, the choice of compute product type to host the service has a significant impact on a digital services’ TCO. The operational impact of the architectural, design, and compute choices can be referred to as Business As Usual (BAU) work, operational work, or maintenance work.

For example, picking AWS EKS or GCP GKE as the compute product to host your service will mean the team will perform BAU work to manage and operate a Kubernetes cluster as part of their working day, compared to picking AWS Lambda or GCP Cloud Functions, where the vendor works to manage scaling and orchestration on your behalf. 

By providing a compute product, the vendor effectively moves the maintenance work from your organisation to theirs, and cost reductions are possible as economies of scale allow them to perform maintenance at a lower cost.

TCO is a critical measure of the success of a digital service, and it’s closely linked to BAU work. Reducing BAU work will reduce TCO. The lower the costs spent on BAU work, the more you can invest in the user experience and features, creating a higher return on investment. 

Categorising compute products

A multitude of compute products exist that cover the differing levels of virtualisation anyone might want, including physical server hardware, infrastructure-as-a-service (IaaS), containers-as-a-service (CaaS), and functions-as-a-service (FaaS). 

We can categorise compute products like EKS, GKE, AWS Lambda, physical servers, Heroku, and EC2 into the generic product types of Servers, Virtual Machines, Containers, and Functions. 

Each product type has a different impact on the maintenance work the team will have to perform in order to use it, and each product type has different constraints that will enable the vendor to provide the product.

Several variant products exist at the boundaries of the above product types, typically improving the usability of the product type and offering reductions in TCO. Care must be taken where responsibility for maintenance and security patching lies with variant products, to ensure TCO isn’t accidentally increased.

Examples of variant products include functions with a container artefact format, to enable customisation of the function execution environment; for example, AWS Lambda Container Image, or containers that use buildpacks to achieve a similar level of user experience to a function such as Google App Engine or Heroku.

Visualising compute product maintenance work

The maintenance work of compute product types can be visualised as concentric circles, or a series of layers (much like an onion, matryoshka dolls, or ogres). 

If you choose a server product, you’re not just responsible for providing power, cooling, storage, and networking, you also need to configure storage & networking, upgrading the OS, and patching framework vulnerabilities.

Any compute product type can be used to build a PaaS

Many organisations are building, or have built, internal PaaS centred around digital workloads (digital platforms) or data workloads (data platforms); they then face a choice on the compute product they use to underpin that PaaS.

Any compute product type can be used as a basis for an internal PaaS. It’s important to remember that you’ll need much more than just the compute product to build a user-friendly platform. Common platform functionality and capabilities are detailed in the Equal Experts’ Digital Platform playbook

What compute should you pick today?

If you’re starting to build a digital service today, choose a function-first approach and pick FaaS products to minimise your maintenance work, so you can spend that time elsewhere. 

If you have a need to customise the execution environment and are happy to spend more time maintaining everything inside your container image, it makes sense to use a CaaS product to minimise maintenance time.

You may find, as you design and build services using containers or functions, that you’ll spend more time thinking about the boundaries between services to keep them highly cohesive and loosely coupled. However, you will still enjoy a significant reduction in TCO compared to time spent on maintenance of other compute product types.

Using a function-first approach means you’re able to review other compute products when there are requirements for greater customisation. It also facilitates making informed choices quickly as to whether it’s worth increasing the time spent on maintenance to satisfy that requirement.

Final Thoughts

The total cost of ownership is significantly influenced by the choice of compute product because of the related maintenance work it creates. By choosing compute products like containers or functions you can significantly reduce the maintenance work performed by your teams.

By removing maintenance work you can use that time for higher value work such as proactively investigating failures in your services or introducing secure delivery ways of working.