I. From Training to Inference
Training produces the model. Everything examined so far — the scaling laws, the data pipelines, the massive clusters running for months — results in a static artifact: a set of weights frozen at a point in time. Training happens once per model generation. Its costs are enormous but bounded. And those costs, as significant as they are, represent only the first claim on compute.
Inference is the second — and it is the larger one. Inference is what happens every time someone uses the model. A prompt goes in, the weights are loaded into accelerator memory, and the network performs a forward pass — matrix multiplications through every layer — to produce a single token. Then it repeats, token by token, until the response is complete. Every question answered, every line of code generated, every agent action taken is an inference event. If training is building the factory, inference is running the assembly line — and the line never stops.
Training costs scale with how many models you build, but inference costs scale with how many people use them, how often, and how much computation each request demands.
Training is a capex event. Inference is an ongoing operational load that grows with every new user, every new application, and — as this section will demonstrate — every new generation of model architecture. Inference already accounts for the majority of AI workload on hyperscaler infrastructure, and that ratio is widening.
The Escalating Cost of Intelligence
Before examining how inference is consumed, a foundational dynamic must be established: the compute required to generate a single response is itself increasing — rapidly and structurally — as models get smarter.
Model capability is now being pushed forward by scaling computation at inference time — known as test-time compute. The core insight: a model produces substantially better outputs if it "thinks" before answering, generating intermediate reasoning tokens that explore the problem, evaluate approaches, and self-correct before committing to a response.
Chain-of-thought reasoning is the most visible implementation. Rather than mapping directly from prompt to answer, the model generates an extended internal trace — thousands or tens of thousands of tokens — before producing visible output. OpenAI's o-series, Anthropic's extended thinking, and DeepSeek's R1 all implement this. A question that once generated 300 tokens of output now generates 300 visible tokens plus 5,000 to 50,000 reasoning tokens that consume GPU cycles and memory but never appear to the user. Same useful output. An order of magnitude more compute.
More advanced techniques push further. Tree-of-thought strategies branch into multiple parallel reasoning paths, evaluate their promise, prune dead ends, and converge on the strongest. The model generates and discards entire sequences that never appear in any output. Verification loops layer on top: after generating an answer, the model re-reads its own output, checks for errors, and revises — running inference on its own inference.
II. The Three Modalities of Inference Demand
A. Chat: The Floor
Chat is the most visible form of inference. A user types a prompt, receives a response. One in, one out, at human speed. This drove the initial adoption wave beginning in late 2022.
Chat's compute profile has already shifted. A circa-2023 interaction produced a few hundred tokens via direct generation. A 2025-era interaction with a reasoning model produces those same few hundred visible tokens plus thousands to tens of thousands of reasoning tokens underneath. The user sees the same concise answer. The cluster behind it performed 10–50x the computation. As providers make reasoning the default mode — which competitive pressure demands, because reasoning models produce better answers — the average compute cost per chat interaction rises even for the same users asking the same questions.
Chat remains the baseline rather than the primary growth driver because of its usage pattern: synchronous, human-speed, a handful of exchanges per session. Human reading speed rate-limits consumption. But hundreds of millions of users generating requests that each carry an inflating compute payload still produce enormous aggregate demand. Chat is the floor of inference consumption. It is a floor that is rising with every model generation.
B. API: The Multiplier
The API layer is where inference demand detaches from human speed. When developers integrate model access into applications, three dynamics change simultaneously.
The volume becomes programmatic. A single application can issue millions of inference calls per day — a coding assistant evaluating every pull request, a customer platform routing every message through classification and generation, a search engine rewriting every query. These are persistent, automated streams running at machine speed.
The context lengths expand. API use cases routinely push into long-context territory: entire codebases at 50,000–200,000 tokens, full contracts, complete earnings transcripts. The compute cost of processing context scales aggressively — a 100,000-token input is not 100x a 1,000-token input but architecturally more demanding due to attention computation overhead.
The calls chain. A RAG pipeline might reformulate a query, retrieve documents via an embedding model, then generate a final answer — a minimum of two LLM passes for one user-visible interaction. Compound pipelines are now standard architecture.
Smarter models attract harder tasks. Harder tasks consume more compute. The spiral feeds itself.
Developers select the most capable models because capability is product quality. The most capable models are reasoning models — the most compute-intensive per call. A three-call pipeline using a reasoning model is not 3x a single call; it is 3x a call that is itself 10–50x more expensive than the equivalent call eighteen months ago. And as models become more capable, developers delegate harder problems, demanding deeper reasoning and longer thinking traces.
Every SaaS platform, developer tool, and enterprise workflow adding an "AI feature" adds a new persistent stream of inference demand — a stream whose per-request intensity is escalating with each model generation.
C. Agents: The Step Function
Agents represent a qualitative shift in inference economics. An agent is not a request-response pair. It is an autonomous loop: receive a task, reason about it, take an action (call a tool, write code, search the web, read a file), observe the result, reason again about what to do next. This repeats — dozens or hundreds of times — until the task is complete.
The compute implications are multiplicative across every dimension simultaneously.
Token volume per task. Where a chat exchange might consume 1,000–3,000 visible tokens (10,000–50,000 total with reasoning) and an API pipeline might consume 10,000–200,000, a single agent task routinely consumes hundreds of thousands to over 1,000,000 tokens. A coding agent implementing a feature reads files, writes code, runs tests, interprets errors, revises, and iterates — each step a full inference call with a growing context window accumulating the history of every prior action.
Reasoning at every step. This is where the escalating cost of intelligence collides most forcefully with the agent paradigm. Every step in an agent loop is a decision point: evaluate state, consider options, choose an action. These are exactly the problems where extended reasoning produces the largest quality gains. A 30-step task where each step invokes 10,000 tokens of chain-of-thought reasoning generates 300,000 reasoning tokens alone — before counting input context, tool outputs, or visible responses.
Autonomy at machine speed. Agents eliminate the human-speed bottleneck entirely. A user launches a task and walks away. No reading pause, no thinking time, no context switch between steps. The agent consumes compute continuously at machine speed until termination. Sophisticated workflows spawn multiple agents in parallel — one researching, one coding, one reviewing — each independently consuming inference resources at full throughput.
Compounding context pressure. Agents accumulate state. Each observation appends to the running context. By the midpoint of a complex task, the agent operates near maximum context length on every call. Each step is not just one more inference call — it is an increasingly expensive inference call, as growing context increases both prefill compute and KV-cache memory requirements. The cost curve within a single task is not linear. It accelerates.
III. Drivers of the Inflection: Why Demand Compounds
The frontier lab leaders identify several compounding drivers that will push inference demand into a new regime. These are not speculative — they are grounded in current product trajectories and disclosed data.
Agents & Autonomous Workloads
Agentic AI is the single largest inference multiplier on the horizon. A human typing a query generates a handful of API calls. An agent performing a multi-step task — booking travel, debugging code, managing a pipeline — generates orders of magnitude more inference. Amazon CEO Andy Jassy noted companies are "just starting to think about deploying AI agents." Microsoft CEO Satya Nadella said 80% of CIOs plan to adopt Copilot within 12 months. Anthropic's economic index found that three-quarters of businesses using Claude do so for "full task delegation." Each autonomous workflow is a persistent, high-volume inference consumer.
Reasoning & Thinking Models
Thinking models — like o-series, Gemini 2.5 Pro, and Claude's extended thinking — consume dramatically more tokens per query. Analysis showed Gemini Flash 2.5 uses roughly 17× more tokens per request than its predecessor.
Model Capability Overhang
Nadella introduced the concept of "model overhang" — the gap between what models can do and what users are actually using them for. Altman said models have "already saturated the chat use case" but their capabilities extend far beyond chat. As enterprises close this adoption gap, inference demand unlocks in step-function increments. Amodei stated at Morgan Stanley in March 2026 that what labs see internally "is far more crazy than what the outside world perceives" and that in 2026, those capabilities will "spill over into the real world on a large scale."
Price Declines Drive Demand Expansion (Jevons Paradox)
Google lowered Gemini serving unit costs by 78% in 2025. Inference prices across all providers have fallen dramatically. But rather than reducing total spend, cheaper inference unlocks new use cases and higher volumes. Amodei noted that Anthropic is "just beginning to optimize for inference." Every reduction in per-token cost expands the addressable market and increases total tokens consumed. The frontier labs' revenue is growing despite steep price cuts — the volume effect dominates.
The Enterprise Diffusion Curve Is Early
Despite explosive growth, penetration remains low. Nadella described the current moment as "only at the beginning phases of AI diffusion." Altman said 2025 was the year enterprise growth outpaced consumer for the first time, with enterprise now a "major priority" for 2026. Jassy said "the lion's share of that demand is still yet to come" and predicted the middle phase of enterprise adoption "may end up being the largest and the most durable" part of the AI market. Zuckerberg stated that Meta's demand for compute resources "increased even faster than our supply" despite massive GPU buildouts. The S-curve of enterprise adoption is inflecting, not topping.
IV. Demand Signals: Frontier Model Labs
Inference demand is not approaching saturation. It is accelerating toward a massive inflection. The evidence from the companies with the deepest visibility into this market converges on a single conclusion: demand is outrunning supply at every layer of the stack, and the drivers of future demand are compounding, not linear.
This synthesis draws on five interlocking bodies of evidence, each sourced from frontier lab and hyperscaler primary disclosures: (1) token volume growth at unprecedented scale, (2) user and developer adoption trajectories, (3) enterprise spending signals, (4) revenue acceleration that directly proxies inference consumption, and (5) capital expenditure commitments that reflect private demand signals invisible to public markets.
Token Volumes: The Most Direct Measure of Inference
Token throughput is the closest available proxy for aggregate inference demand. Google's disclosures provide the most granular public time series. At Google I/O in May 2025, Sundar Pichai revealed that Google's products and APIs were processing 480 trillion tokens per month — a 50× increase from 9.7 trillion just twelve months prior. By July 2025, that figure had doubled again to 980 trillion. By October 2025, it crossed 1.3 quadrillion tokens per month.
That is a 134× increase in 18 months.
Google's AI infrastructure lead, Amin Vahdat, told employees internally that the company must double its serving capacity every six months to keep pace with demand. On the API side specifically, by Q3 2025, Pichai disclosed that first-party models like Gemini process 7 billion tokens per minute via direct customer API usage alone.
At OpenAI, Altman reported processing over 6 billion tokens per minute on the API as of October 2025, with 4 million developers building on the platform. OpenAI's API traffic doubled within 48 hours following GPT-5's launch, reaching compute-constrained capacity. Altman stated unambiguously: "Most of what we're building out at this point is the inference."
Inference Cost Declines: The Jevons Paradox Engine
The Data: A 150× Cost Decline in 30 Months
The cost of inference at GPT-4-class quality has fallen at a rate that exceeds almost any precedent in technology — faster than PC compute, faster than bandwidth during the dotcom era, and faster than cloud storage.
At the frontier tier (best available model), costs have also fallen sharply — GPT-5.2 at $1.75/$14 input/output represents roughly a 75% decline from GPT-4's launch pricing, while delivering dramatically superior reasoning. Anthropic's own pricing reflects the same trajectory: Claude Opus 4.5 costs $5/$25 per million tokens, down from Opus 4's $15/$75 — a 67% reduction generation-over-generation.
What the CEOs Say About Cost Dynamics
The leaders closest to inference economics universally frame cost declines as demand accelerants, not threats to revenue:
On inference, we have typically seen more than 2× price-performance gain for every hardware generation, and more than 10× for every model generation due to software optimizations. … When token prices fall, inference computing prices fall; that means people can consume more. And there will be more apps written.
— Satya Nadella, Microsoft Q2 FY2025 earnings / November 2025 interview
We were able to lower Gemini serving unit costs by 78% over 2025 through model optimizations, efficiency and utilization improvements.
— Sundar Pichai, Alphabet Q4 2025 earnings
Yet despite this 78% cost reduction, Google's token throughput grew 134× in 18 months, Cloud AI revenue grew 400% YoY, and Pichai guided 2026 CapEx to $175–185 billion. The volume effect overwhelmed the price effect completely.
Why This Makes Inference Demand Insatiable
The relationship between inference cost and inference demand is not linear — it is hyperelastic. This is Jevons Paradox at industrial scale:
Price drops unlock new use cases. At $60/MTok, only high-value enterprise queries justified API calls. At $0.40/MTok, background agents running continuously become economical. The addressable market expands by orders of magnitude.
Volume overwhelms price. Google cut serving costs 78% but token volume grew 50× in the same period. OpenAI has slashed API prices by 90%+ since GPT-4, yet revenue grew from $1B to $13B+. Anthropic's inference costs exceeded projections by 23% — because demand grew faster than efficiency gains.
The cost floor is nowhere in sight. Nadella says Microsoft sees 10× software optimization per model generation, with hardware adding another 2× per generation. Amodei says optimization is "just beginning." Google just launched tiered inference pricing (Flex at 50% discount for latency-tolerant workloads) — indicating further segmentation ahead.
Cheaper inference creates compound demand. Each price reduction enables a new class of application (agents, background processing, always-on copilots) that generates 10–100× more tokens per user session than a simple chat query. The demand created by cheaper inference dwarfs the demand it replaces.
User & Developer Adoption: The Demand Surface Is Expanding
Inference demand scales with the number of users and the depth of their engagement. ChatGPT tripled its weekly active users from 300 million (Dec '24) to 900 million (Dec '25). Users send roughly 2.5 billion prompts per day. Meta AI crossed 1 billion monthly active users in Q1 2025 — faster than any AI product in history. The Gemini app reached 650 million monthly active users by Q3 2025, with queries tripling from Q2.
Developer adoption — the leading indicator for enterprise inference — is surging in parallel. Google reported 7 million developers building with Gemini (5× YoY), with Vertex AI usage up 40×. OpenAI counts 4 million developers. Anthropic's Claude Code hit $2.5 billion in annualized revenue by February 2026. Dario Amodei described coding as the "strongest leading indicator of the AI capabilities explosion." Every developer building an AI-powered application creates a persistent inference workload that scales with their user base.
Enterprise Adoption: The Inference Multiplier
Enterprise and developer usage is the critical vector for inference inflection. Unlike consumer chatbot interactions, enterprise workloads are persistent, high-volume, and growing in complexity. The data from frontier labs and hyperscalers is unambiguous: enterprise adoption is accelerating, not plateauing.
Frontier Lab Enterprise Signals
At OpenAI, enterprise seats grew 9× year-over-year, weekly enterprise messages increased 8×, and the company now counts over 3 million paying business users and 1 million business customers. Usage of Custom GPTs and Projects increased 19× in 2025. Altman stated in late 2025 that the API business grew faster than ChatGPT consumer — a structural shift toward inference-heavy developer and enterprise workloads.
At Anthropic, where roughly 80% of revenue comes from businesses via the API, annualized revenue grew from $1 billion in January 2025 to $19 billion by March 2026 — overwhelmingly driven by enterprise API consumption. The number of customers spending $100K+ annually grew 7× in a year; over 500 now spend $1M+ annually. Eight of the Fortune 10 are Claude customers. Amodei described Anthropic as "the fastest growing software company in history at the scale that it's at."
Hyperscaler Enterprise Signals
Google Cloud revenue grew 48% in Q4 2025 to a $70B+ run rate, with backlog surging 55% quarter-over-quarter to $240 billion. Revenue from products built on Google's generative AI models grew nearly 400% year-over-year in Q4. In December 2025 alone, nearly 350 Google Cloud customers each processed more than 100 billion tokens. Pichai reduced Gemini serving unit costs by 78% in 2025, yet total demand still overwhelmed capacity.
AWS grew 24% in Q4 to $35.6 billion in quarterly revenue ($142B run rate), with Jassy describing AI capacity as being monetized "as fast as we can install it." Trainium chips represent a $10B+ run-rate business with over 100,000 companies using them. Microsoft reported Azure growth of 39%, with CFO Amy Hood noting that customer demand "continues to exceed our supply" despite $37.5 billion in quarterly CapEx. 80% of CIOs surveyed plan to adopt Copilot within 12 months.
V. Revenue Trajectories: Inference Monetization at Scale
Revenue at frontier labs is the most direct monetization proxy for inference consumption. The trajectories are extraordinary.
OpenAI grew from $1B (2023) to $3.7B (2024) to an estimated $13B+ (2025), with Altman suggesting $100B revenue by 2027 on the BG2 Podcast. Anthropic grew revenue 10× every year: zero to $100M (2023), $1B (2024), and reaching $19B ARR by March 2026 — a trajectory Sacra estimates at 1,167% year-over-year growth.
Critically, most of Anthropic's revenue comes from API inference — businesses paying per-token for model outputs. This makes Anthropic's revenue curve nearly a pure proxy for enterprise inference demand growth. The company's 8- and 9-figure enterprise deals tripled in 2025 versus 2024, and average business customer spend grew 5×.
The convergence of evidence is striking in its unanimity. Every frontier lab CEO — Altman, Amodei, Pichai — reports that demand exceeds supply and that the current constraint is infrastructure, not willingness to pay. Every hyperscaler CEO — Nadella, Jassy, Zuckerberg — is doubling capital expenditure based on demand signals they describe as unprecedented.
Altman says he could double revenue overnight with double the compute. Jassy says capacity is monetized "as fast as we can install it." Google must double serving capacity every six months. The drivers ahead — agents, reasoning models, enterprise diffusion, price-driven demand expansion — are compounding, not additive.
Dario Amodei's analogy of rice on a chessboard captures it precisely: we are on roughly the 40th square, and the shocks from the first 39 squares combined are a fraction of what's ahead. Inference demand is not merely growing. It is approaching an inflection that the companies closest to it believe will reshape the economy.
VI. Demand Signals: Enterprise
The BCG CEO Data Point analysis of 6,027 company earnings calls in Q4 2025 provides a striking demand-side signal. Across the entire corporate landscape — not just tech — "AI" is the single most frequently mentioned keyword cluster and is still growing quarter-over-quarter. More telling are the high-growth, emerging clusters:
Industry-Specific Value Levers
There is a reason enterprise adoption of AI is accelerating despite the complexity and cost involved. The reason is not enthusiasm. It is not trend-following. It is the specific, quantifiable economics of operational improvement — the math that business unit owners, plant managers, and P&L leaders run every day.
Enterprises do not operate in abstractions. They operate inside very specific economic machines. Each machine has levers. Those levers have measurable output. And the operators who manage those machines know, with precision, what moving a lever is worth.
Every industry has a physics variable. A small delta produces nonlinear financial outcomes. When AI moves that variable — even slightly — the adoption decision is immediate.
And critically: every deployment in this table is a persistent inference workload. Visa's fraud detection runs on every transaction, 24/7/365. McDonald's AI processes orders during every shift. Klarna's assistant handles millions of conversations monthly. Salesforce's agents execute 3 billion workflows per month. None of this is a one-time computation. It is continuous, scaling inference consumption that persists for as long as the improvement persists — which is to say, indefinitely.
This is why enterprise AI adoption is not a matter of if. It is a matter of how fast. The economic incentive is too direct and too large for operators to ignore.
VII. Enterprise Consulting Signals
There is a category of market participant whose role in this transition is systematically underappreciated: the management consulting industry. Every major enterprise technology adoption in the last three decades — ERP, cloud, mobile — has followed an identical pattern. The technology emerges, early adopters demonstrate results, and then the consulting firms arrive to systematize the transition for the rest of the market. McKinsey, BCG, Deloitte, Accenture, and PwC do not merely measure enterprise adoption. They cause it.
They build the frameworks, staff the transformation offices, publish the benchmarks, and — critically — create the competitive pressure that forces laggards to move. Their involvement is itself a structural signal: when these firms are publishing data showing that 88% of organizations now use AI in at least one business function, that the share using it across five or more functions has grown 5× in four years, and that "future-built" companies are capturing 5.3× the revenue impact and 3.0× the cost reduction of laggards — those findings do not simply describe the market. They reshape it.
McKinsey's global survey of nearly 2,000 organizations shows adoption rising from 20% in 2017 to 88% in 2025, with gen AI usage alone surging from 33% to 79% in just two years — and critically, it is not shallow adoption. The share of companies deploying AI across three or more functions tripled from 17% to 51% since 2021, meaning inference workloads are multiplying within organizations, not just across them.
BCG's Build for the Future study of 1,250 companies provides the ROI proof that sustains the flywheel: companies at the frontier of AI maturity deliver 1.7× the revenue growth, 3.6× the three-year total shareholder return, and 2.7× the return on invested capital — and they are reinvesting the gains, spending 120% more on AI than laggards, creating a compounding advantage that widens with each cycle.
Perhaps most significant for inference demand, BCG finds that the share of AI-driven value from agentic systems is expected to nearly double by 2028, with 46% of companies already experimenting with agents and 30% allocating more than 15% of their AI budgets to agentic workloads. Each agent deployed is not a single inference call — it is a persistent, autonomous workload that consumes tokens continuously.
VIII. The Enterprise Workload Opportunity
Andy Jassy, Amazon's CEO, offered a framing on the company's fourth quarter 2025 earnings call that deserves close attention. He described the current AI compute market as a barbell. On one end sit the frontier AI labs — OpenAI, Anthropic, Google DeepMind, xAI, Meta AI — spending enormous sums on training and running the largest models. On the other end sit the productivity and cost-avoidance workloads that enterprises are already extracting value from today: customer service automation, business process automation, fraud detection, document summarization. These are the use cases that justify the current wave of spending.
But the middle of the barbell, Jassy said, is where the real opportunity lies. That middle consists of all the enterprise production workloads — the millions of custom, internal, and legacy applications that actually run the world's businesses. Jassy was explicit: this middle portion "very well may end up being the largest and the most durable" source of AI compute demand.
— Andy Jassy, Amazon Q4 2025 earnings
This is the single most important demand signal in the entire AI infrastructure thesis, and it is the one that receives the least attention.
The Invisible Majority
To understand why the enterprise workload opportunity is so large, you first have to understand what enterprise applications actually are — and how many of them exist.
When most people think of enterprise software, they think of the names they recognize: Salesforce, SAP, Oracle ERP, Workday, ServiceNow. These are the core commercial platforms — the ERP systems, CRM platforms, and HR suites that form the operational backbone of large organizations. They are important, well-understood, and represent the visible tip of the enterprise software stack. But they account for a remarkably small fraction of the total.