Valuemaxxing
Demonstrating value with AI involves more than counting tokens
About a month ago, the Financial Times reported that Amazon had set a target for 80% of its developers to use AI every week, and built internal leaderboards that ranked them by how many tokens they used. So they burned tokens.
Some of them wired up agents whose main purpose was to consume tokens on their behalf. The practice got a name: tokenmaxxing.
Last year, Meta’s Chief People Officer Janelle Gale told employees that “AI-driven impact” would be a “core expectation” in 2026. Employees created a similar leaderboard to Amazon’s, which was only taken down after its existence went public.
These same engineers are concerned about the environmental impact of AI as well as its efficacy. But tokenmaxxing is the rational response to what they are being asked to do.
If you tell a group of clever, busy people that the number on the board is what matters, and you hint that the number might find its way into a performance review, it doesn’t matter what you intended. You’re going to fall foul of Goodhart’s law.

Leadership may want to ensure their engineers upskill in AI. They may have a deeply-held conviction that competitive advantage depends on becoming ‘AI-native’ as quickly as possible. Gamifying usage in this way pretty much guarantees that AI usage will increase, but it will not ensure that anything gets better. A developer who solves a problem with one careful prompt quickly learns that their score is lower than their colleague who has set up an agent to thrash through forty.
The use of leaderboards may seem extreme, but it’s the logical outcome of how companies have been measuring their AI investment.
AI adoption has become close to universal. The 2025 DORA report puts AI use among developers at around 90 per cent. DX’s Q1 2026 AI impact report has it at 93 per cent, with engineers reporting that nearly thirty per cent of merged code is now AI-generated. Using these tools is no longer a differentiator. It has become the standard. The usage war has been won, so how do we measure impact? This is where organisations are struggling.
LeadDev’s AI Impact Report 2025 found that only 18 per cent of organisations are measuring the impact of AI coding tools at all. Sixty per cent of respondents said they lacked clear metrics to evaluate the impact of AI.
So the real issue isn’t that some companies are incentivising AI usage, or that engineers are responding by gamifying a number. It’s the fact that most companies don’t have a number at all.
The companies that do have metrics that may be useful are looking at development time per feature, weekly time saved per engineer, and time spent reviewing AI-suggested or AI-generated code.
Beyond tokenmaxxing, the numbers that make headlines tend to be the amount of code written by machine. Microsoft claims that Copilot now writes 40% of its code, which seems impressive, but if time spent reviewing this code outweighs the time saving, what does it mean for the engineering experience? If AI-generated code quickly becomes seen as technical debt and needs to be rewritten next quarter, what does that say for value created? How do we measure the long-term impact?
We’re not there yet.
DX has published a framework for AI metrics split across utilisation, impact, and cost.

So far, the impact metrics are still process-oriented. This makes sense because AI is an amplifier. AI does not fix a team, it amplifies what is already there. Strong teams get stronger. Struggling teams find their struggles arriving faster and in greater volume.
The DX Q1 report demonstrates this. They find that quality is volatile. Some teams improve as AI use rises. Others see defects climb by as much as half. A team can now produce more code, more confidently, at higher speed. If a team lacks direction, they are getting faster at building the wrong thing; AI is the most powerful accelerant yet bolted to that particular engine.
Laura Tacho, DX’s former CTO, puts it plainly: early metrics, such as acceptance rate, were meant to show whether a tool was fit for purpose, not to measure its impact across an organisation.
In many ways, as we move into a new paradigm, it makes sense to revert to more basic metrics, while looking to attribute step changes in performance to AI. Did the teams deliver more frequently? Did the product reach the user faster, break less, and take less time to restore when it did fail? How do we attribute those changes to AI rather than other improvement? Optimising for token consumption doesn’t help with any of that. As we learn more, we can work out what an AI SPACE programme could look like.
Amazon’s leaderboard is gone. A senior leader reportedly told staff, in the plainest possible terms, ”please don’t use AI just for the sake of using AI”, and the company moved towards a measure of useful code shipped rather than tokens spent.
Valuemaxxing should be the goal. Token usage is like GDP. It tells you everything and nothing. Valuemaxxing will take time to emerge. We’re in the rollout phase of this technological shift. A lot of the metrics we’ll eventually use are lagging, and some don’t exist yet. In the meantime, we can steer clear of targets and focus on the best proxies we have: did throughput and system stability increase? What is happening to the developer experience? We may not have the right numbers yet, but we can do better than count tokens.





