Phil Parker

Global Head of Technology Strategy | AI in Delivery
AI

October 2, 2025

Why is AI making us talk about “developer productivity” (again)?

When The Pragmatic Engineer recently asked how tech companies are measuring the impact of AI, the answer turned out to be… complicated. Some count daily active users of AI tools. Others look at the percentage of code generated by AI. Still others are tracking pull request cycle times, developer satisfaction scores, or the percentage of features shipped that relied on AI at some point.

It makes for an impressive list. But it also poses a new question every time a new technology wave hits software delivery – Agile, DevOps, Cloud, and now AI – will people always jump back to asking about developer productivity? Finance teams are not asked to optimise the speed of reconciling a single transaction. Accountants aren’t measured by the number of spreadsheets created per hour. Yet developer efficacy is repeatedly reduced to individual tasks (aka typing speed!).

I think this is because people measure what’s easy, rather than what’s meaningful. They chase productivity as if it were a single number, ignoring the fact that software delivery is a complex, end-to-end system.

The risk is that AI metrics become an echo chamber: lots of reverberation, but little clarity.

So how do we avoid this? We think it helps to group AI success metrics into three broad categories:

  • AI usage – adoption and activity metrics that track how much AI is being used across the organisation (in some cases this is exposed as a measure of cost).
  • Task productivity – local productivity measures that focus on specific tasks and time savings.
  • Delivery success – system-level outcomes that show whether teams are delivering value faster, with higher quality, and less risk.

Each of these categories has its place. But each also comes with traps that leaders need to be wary of if they want AI to actually improve delivery, rather than just fill up another dashboard.

1. AI usage: The Goodhart’s Law problem

The simplest way to measure AI is to count how often it’s being used. Number of active users, number of prompts, number of pull requests that included AI-generated code. These metrics appear in almost every company survey because they’re easy to collect and easy to visualise.

And in aggregate, they can be useful. They tell us something about adoption trends, which tools are gaining traction, which workflows are sticking, where licenses are sitting idle. They can help with communications – both internally (“look how fast this is spreading!”) and externally (“we’re an AI-forward organisation!”).

The problem comes when usage metrics are mistaken for outcomes.

This is where Goodhart’s Law kicks in: when a measure becomes a target, it ceases to be a good measure.

If you tell teams they will be rewarded based on the number of AI prompts they make, don’t be surprised if prompts skyrocket without any impact on delivery. If you chase the percentage of code written by AI, don’t be shocked when developers generate boilerplate just to hit their targets.

The paradox is that the best uses of AI (in-fact, of most technologies) often show up as less activity. A great AI-generated test suite might reduce the number of bugs found later. A developer who uses AI to reason through a design problem might generate fewer lines of code overall. Counting “activity” misses the point entirely.

Alongside usage, it is also worth tracking cost. AI tools are not free, and usage at scale can quickly spiral into significant spend.

Understanding cost matters for governance and investment purposes – ensuring you know where money is going, and that spend doesn’t run out of control. But again, cost is not a measure of performance, efficacy, or success. A high bill doesn’t prove value any more than a low one does; it simply tells you the scale of the investment.

So, by all means, track AI usage and cost — but treat them as inputs, not outcomes. Aggregate them, anonymise them, look for patterns. Use them to understand adoption and investment, not to judge delivery performance. Because usage and cost tell you something about behaviour, but very little about value.

MIT reported this year that 95% of generative AI pilots at companies are failing – reinforcing that how much AI is being used is much less important than whether it’s delivering value.

 

2. Developer task productivity: The theory of constraints problem

The next temptation is to look at task-level productivity. How many hours does AI save a developer per week? What percentage of code suggestions are accepted? How much faster are pull requests being drafted?

These are seductive metrics because they seem to show “hard” productivity gains. And at the local level, they often do. Developers will tell you that AI makes them faster at writing boilerplate, generating tests, or exploring new APIs. Studies back this up: local productivity improvements are real.

But here’s the catch: local improvements rarely move the needle at the system level.

This is where the Theory of Constraints comes in. In any complex system, throughput is determined by the bottleneck. Improving efficiency in non-bottleneck steps is largely irrelevant. Making developers type faster won’t matter if the real constraint is stakeholder approvals, test environments, or security reviews. And in most organisations, the actual time developers spend coding is a small fraction of the total delivery timeline.

There’s also a problem of extrapolation. Just because a single task takes half the time, it doesn’t mean overall delivery is twice as fast. Often, the speed-up in one step is offset by extra time elsewhere: more preparation before the task, more review afterwards, or more experimentation upfront just to achieve that single improvement. Isolated metrics create a misleading picture when they’re extrapolated to the system as a whole.

Worse, focusing too much on task productivity metrics can negatively impact behaviour. We reduce developers to code machines, ignoring design, collaboration, and problem-solving – the very things that make software valuable. We end up optimising for outputs, not outcomes.

That doesn’t mean task-level metrics are useless. They can be a powerful internal tool for teams to challenge their own assumptions. If a team thinks AI is saving them hours on code reviews but the data shows otherwise, that’s useful feedback. If a team wants to experiment with AI-assisted testing, local productivity measures can help them evaluate the impact.
But these metrics should remain team-level, reflective, and exploratory – not executive KPIs. They should drive continuous improvement, not performance management. Because task productivity tells you something about potential, but not about delivery.

See Marco Vermeulen’s “Madness to Method” series for some tangible examples of how EE Developers are improving their effectiveness with AI.

 

3. Delivery success: The real test

The third category – and the only one that really matters – is delivery success. Does AI actually help teams deliver value faster, at higher quality, and with less risk?

Here, system-level metrics like the five Accelerate metrics (Deployment Frequency, Lead Time for Changes, Unplanned Tech Work Rate, Change Failure Rate and Time to Restore Service) remain the gold standard. They don’t just measure activity, they measure flow of value. They reflect whether teams are improving the end-to-end system, not just a single step.

The key point is that delivery success can’t be “proven” by academic-style experiments that isolate causality. Teams rarely improve by running A/B tests on themselves. Instead, they improve through experience, reasoning, and iterative feedback. They experiment with practices, they learn what works in their context, and they adapt.

AI should be understood the same way. Its impact will be seen in shorter feedback loops, in smoother deployments, in fewer defects, in happier developers. Not because AI itself was isolated as the causal factor, but because teams used it to improve the system.

This is where leadership matters. If you want to know whether AI is working, don’t ask “how many lines of code did it generate?” Ask: are our teams deploying more frequently? Are changes reaching customers faster? Are we reducing rework and recovery times? Those are the metrics that connect directly to business outcomes.

Our work with Travelopia uncovered that AI tools showed promise in accelerating delivery: “The AI-powered team of three engineers replaced the entire lead scoring system for three regions in just three months, whereas four engineers following traditional agile practices was only able to cover a fraction of the functionality in four months.” Read the case study.

Conclusion: Beyond the dashboard

The Pragmatic Engineer’s survey of AI metrics was interesting because it revealed not just what companies are measuring, but how uncertain they are about what matters.

Uncertainty of this type can be dangerous, because the wrong metrics don’t just waste time – they can actively divert behaviour.

Here’s a rule of thumb:

  • Usage metrics show adoption, but not value.
  • Task productivity metrics show local improvements, but not system throughput.
  • Delivery success metrics show flow of value, and are the only ones that matter at the enterprise level.

That doesn’t mean you should ignore the first two. They can be useful signals, if you keep them in their place.

But if your AI dashboards are not helping you deliver the only productivity metric that matters – improved time and cost to value – then they’re just more noise.

About the author

Phil Parker is Head of Technology Strategy at Equal Experts, where he helps organisations navigate the rapidly evolving technology landscape and deliver meaningful business outcomes. With more than two decades of experience spanning software product delivery, agile transformation, and strategic leadership, Phil specialises in shaping technology approaches that align with organisational goals and deliver lasting value.

He is passionate about applying emerging technologies — at the moment particularly AI in Delivery — in practical, outcome-focused ways, and about building collaborative, empowered teams that solve complex problems. Phil’s work is driven by a belief that great technology strategy is as much about people and culture as it is about tools and platforms.

You may also like

Blog

Could MLOps be the key to accelerating your next AI project?

Blog

Adventures in how to use ChatGPT – can it answer questions about my business?

Blog

Adventures in fine tuning ChatGPT

Get in touch

Solving a complex business problem? You need experts by your side.

All business models have their pros and cons. But, when you consider the type of problems we help our clients to solve at Equal Experts, it’s worth thinking about the level of experience and the best consultancy approach to solve them.

 

If you’d like to find out more about working with us – get in touch. We’d love to hear from you.