Strategy · June 2026

How to Think About AI as We Measure Outcomes Versus Token Spend

Wato Labs

We are in a strange phase of AI adoption. For the last few years, many companies have operated with the assumption that more AI spend leads to more productivity, and more productivity eventually becomes more growth.

Early on, that assumption was often directionally right. Engineers shipped faster, support teams answered more questions, analysts moved through research more quickly, and operators automated work that used to take hours.

But that relationship does not scale forever. There is a Pareto frontier for AI spend. The first wave of investment can create obvious leverage because the work being improved is easy to find: draft the email, summarize the meeting, generate the code, search the docs, route the ticket.

After a point, though, the curve starts to flatten. Another dollar of AI spend does not automatically create another dollar of business value. The issue is not that AI stops working. It is that AI changes where the bottleneck lives.

Pareto frontier chart showing AI spend flattening while bottlenecks move to implementation, trust, distribution, and adoption.

The Bottleneck Moves

For a long time, one of the default tensions inside software companies was that product and sales could promise more than engineering could build. Engineering was the constraint. Roadmaps were full, backlogs were long, and the cost of turning an idea into production software was high enough that companies had to choose carefully.

AI changes that balance. If engineering throughput goes up, production code is no longer always the tightest constraint. Teams can prototype faster, generate more code, create more internal tools, and automate more pieces of workflow. That is real leverage, but it also creates a new problem: once output becomes cheaper, the organization has to absorb it.

Someone still has to sell it, package it, deploy it, support it, explain it to customers, and turn it into revenue or retention.

That is the inversion worth paying attention to. The old failure mode was selling something engineering could not build. The new failure mode may be building more than the company can sell, deploy, or operationalize. In that world, more AI spend can produce more motion without producing more progress.

Token Spend Is Not Progress

Token spend is a weak way to understand AI adoption because it measures activity, not value. It tells you that a model was used, but not whether the output was useful, accepted, shipped, sold, or reused.

A sales workflow is not valuable because an agent sent more emails; it is valuable if it creates qualified pipeline. An engineering workflow is not valuable because more code was produced; it is valuable if the code maps to something the business needs, can review, can ship, can maintain, and can sell.

That distinction matters because AI makes it very easy to confuse motion with progress. Companies can generate more messages, more summaries, more code, more workflow runs, and a larger bill while still not knowing which work was worth it.

The question is not simply how much AI did we use? The question is which work improved because AI was involved?

The Hard Part Is Knowing the Marginal Return

The next phase of AI adoption will not be about whether companies spend more on AI. They will. The better question is where the next dollar of AI spend still matters.

Some work deserves frontier models: complex reasoning, ambiguous customer analysis, production debugging, strategic research, and high-stakes decisions. Other work does not. A meeting summary, calendar update, basic CRM lookup, routine extraction job, or simple internal status update should not always be routed to the most expensive intelligence available.

In theory, that sounds obvious. In practice, it is very hard for most companies to act on. For the long tail of teams, AI rollout does not look like a clean workflow architecture. It looks like buying Claude seats, giving engineers access to Codex, letting teams experiment with ChatGPT, and hoping the useful patterns emerge.

That can be a good way to start, but it makes optimization almost impossible. If every person uses AI differently, inside different clients, with different context and different habits, the company has no clean way to know which workflows need which models, which work is repeatable, or where token cost can be reduced without hurting quality.

This is why standardization and governance have to come before serious cost optimization. By standardization, I mean turning repeated AI usage into named workflows: support triage, renewal review, code review, sales research, onboarding, reporting, incident follow-up. By governance, I mean defining who can run those workflows, which tools they can touch, what data they can access, when approval is required, and what gets logged.

Without that structure, cutting AI spend is mostly guesswork. Companies end up doing the blunt thing: restricting access, lowering usage limits, or pushing everyone onto cheaper tools. That may reduce the bill, but it also hurts the teams and individuals who were using AI to create real value.

With structure, companies can be more precise. They can see which workflows justify frontier models, which can run on cheaper models, and which should be simple automation instead of agentic work at all.

Workflow standardization diagram showing scattered AI usage becoming named workflows before routing to fast models, reasoning models, or automation.

Outcomes Need Structure

Outcome-based AI sounds simple until you try to measure it. A company cannot say whether AI helped close a deal, resolve a support issue, ship a feature, or complete an onboarding workflow unless the work itself has some structure.

Otherwise, all the company sees is a trail of prompts, model calls, chat messages, and disconnected tool usage.

The practical work for AI leaders is to define the shape of the work before trying to optimize it. What are the recurring workflows where AI should help? What does success look like for each one? Which systems does the workflow need? Which steps can the agent take alone, and which require review? What should be captured so the company can learn from the run and improve it next time?

That is the missing middle between everyone has AI access and we measure AI outcomes. Most organizations are still closer to the first state. They have bought the tools, but the work has not yet been segmented into repeatable patterns. Until that happens, measuring outcomes is mostly aspirational. There is no stable unit to measure against.

Why Wato Is Well Positioned

This is where Wato fits the shift. Wato helps teams give structure to the work agents do. Instead of treating every AI interaction as a one-off chat, Wato helps teams define reusable workflows, automations, skills, connector access, permission rules, and reviewed memory that agents can use across tools.

That structure matters in an outcome-based world. If a renewal-review workflow always knows which CRM fields to check, which product usage dashboard to use, which support tickets matter, and which output format the team expects, then the company can start to measure whether that workflow worked.

If a support-triage workflow has a defined set of tools, escalation rules, and review requirements, then the company can compare runs over time. If a code-review workflow has known repo access, environment rules, and approval gates, then it becomes easier to understand cost, quality, and risk.

Wato is a precursor to outcome-based AI because it turns scattered AI usage into something that can be repeated, governed, and measured. It helps teams standardize the work first.

Once the work is standardized, companies can ask better questions: which workflows are worth scaling, which should use cheaper models, which require human review, which are blocked by sales or implementation, and where additional AI spend has stopped producing proportional value.

What Companies Will Measure Next

The next phase of AI measurement will look less like API metering and more like operations analytics. Companies will still care about model cost, but only in relation to the work produced.

The useful questions will become things like: what did it cost to qualify a lead, resolve a support issue, prepare a customer report, review a pull request, or complete an onboarding workflow? Which workflows created accepted outputs? Which required human correction? Which model was used, and could a cheaper one have produced the same result?

That is a very different conversation from token spend. It lets finance, engineering, operations, and go-to-market teams discuss AI in the same language.

Instead of asking whether the AI bill is too high in the abstract, they can ask where AI is creating leverage, where the bottleneck has moved, and where more output is becoming work for the rest of the organization to absorb.

This is the shift from measuring AI as usage to managing AI as labor. Tokens still matter, but they are the meter, not the value. The value is the work completed, the customer helped, the lead qualified, the report delivered, the incident diagnosed, or the workflow made repeatable.

The Takeaway

The next winners will not simply be the companies that spend the most on AI. They will be the companies that understand the frontier: where AI spend still creates leverage, where marginal returns flatten, and where the constraint has moved from production to distribution, adoption, trust, or operations.

That requires more than access to good models. It requires standardized workflows, governed tool access, reusable company knowledge, permissions, audit trails, and a way to connect agent activity to business outcomes.

That is what Wato is built for: helping companies turn scattered AI usage into reliable, repeatable, permission-aware outcomes.