- Training produces a static artifact: a set of weights frozen at a point in time. It happens once per model generation.
- Inference runs every time someone uses the model — a forward pass through every layer, token by token, for every question answered, every line of code generated, every agent action taken.
- Inference already accounts for the majority of AI workload on hyperscaler infrastructure, and the ratio is widening.
- Altman stated: “Most of what we’re building out at this point is the inference.”
The asymmetry is structural, not incidental. Training costs scale with how many models you build — a finite, episodic number. Inference costs scale with how many people use them, how often, and how much computation each request demands. These are continuous, unbounded variables. As the number of users, applications, and per-query compute all grow simultaneously, inference becomes the dominant and ever-increasing claim on the world’s AI compute infrastructure. Training is the factory construction. Inference is the assembly line that never stops.
- Chain-of-thought reasoning: a question that once generated 300 tokens now generates 300 visible tokens plus 5,000 to 50,000 reasoning tokens that consume GPU cycles but never appear to the user. Same useful output, an order of magnitude more compute.
- Tree-of-thought strategies branch into multiple parallel reasoning paths, generating and discarding entire sequences. Verification loops run inference on inference.
- Gemini Flash 2.5 uses roughly 17× more tokens per request than its predecessor.
- The research trajectory points explicitly toward architectures that trade more inference compute for better results. Models get smarter because they consume more compute at inference time.
This is not a one-time step change. It is a continuous escalation baked into the direction of AI research. Every improvement in model capability is being achieved by scaling computation at inference time — test-time compute. The models are not getting cheaper to run as they get smarter; they are getting more expensive per query because they are getting smarter. This escalating per-query cost acts as a hidden multiplier on every other demand driver: every new user, every new application, every new modality of consumption runs on a treadmill that is itself accelerating.
- ChatGPT tripled weekly active users from 300 million (Dec ’24) to 900 million (Dec ’25). Users send roughly 2.5 billion prompts per day.
- Meta AI crossed 1 billion monthly active users in Q1 2025 — faster than any AI product in history.
- Gemini app reached 650 million monthly active users by Q3 2025, with queries tripling from Q2.
- A 2025-era chat interaction with a reasoning model performs 10–50× the computation of a 2023-era interaction, due to chain-of-thought reasoning becoming the default mode.
Chat’s growth potential as an inference driver is often underestimated because it is constrained by human speed — synchronous, a handful of exchanges per session. But the constraint applies to frequency, not to intensity. The compute cost per interaction is rising with each model generation as reasoning becomes the default, and the user base is measured in hundreds of millions. Even at human pace, hundreds of millions of users generating requests that each carry an inflating compute payload produce enormous aggregate demand. Chat is not the growth story — but it is the floor beneath every growth story, and that floor rises automatically with each model upgrade.
- A single application can issue millions of inference calls per day — coding assistants evaluating every pull request, customer platforms routing every message, search engines rewriting every query.
- API use cases routinely push into long-context territory: entire codebases at 50,000–200,000 tokens, full contracts, complete earnings transcripts. A 100,000-token input is architecturally more demanding than 100× a 1,000-token input.
- Compound pipelines are now standard: a RAG pipeline requires a minimum of two LLM passes for one user-visible interaction. A three-call pipeline using a reasoning model is 3× a call that is itself 10–50× more expensive than eighteen months ago.
- Google reported 7 million developers building with Gemini (5× YoY), Vertex AI usage up 40×. OpenAI counts 4 million developers.
Three dynamics change simultaneously when developers integrate model access: volume becomes programmatic (millions of calls per day), context lengths expand (scaling compute super-linearly), and calls chain (multiplying per-interaction cost). Developers select the most capable models because capability is product quality, and the most capable models are the most compute-intensive. Smarter models attract harder tasks. Harder tasks consume more compute. Every SaaS platform, developer tool, and enterprise workflow adding an “AI feature” adds a new persistent stream of inference demand whose per-request intensity escalates with each model generation.
- A single agent task routinely consumes hundreds of thousands to over 1,000,000 tokens. A coding agent reads files, writes code, runs tests, interprets errors, and iterates — each step a full inference call with growing context.
- Every step in an agent loop is a decision point where extended reasoning produces the largest quality gains. A 30-step task with 10,000 tokens of chain-of-thought per step generates 300,000 reasoning tokens alone.
- Agents eliminate the human-speed bottleneck entirely. Sophisticated workflows spawn multiple agents in parallel, each consuming inference at full throughput.
- Compounding context pressure: by mid-task, the agent operates near maximum context length on every call. The cost curve within a single task accelerates, not just grows linearly.
- Jassy noted companies are “just starting to think about deploying AI agents.” Anthropic’s economic index: three-quarters of businesses using Claude do so for “full task delegation.”
The agent paradigm is multiplicative across every dimension simultaneously: token volume per task (100–1,000× chat), reasoning at every step (compounding the most compute-intensive form of inference), machine speed (no human pauses), and escalating context (each step more expensive than the last). The arithmetic is explicit: a 2023 chat consumed X tokens; a 2025 chat with reasoning consumes 5–30X; an API pipeline consumes 10–100X of the new baseline; a single agent task consumes 100–1,000X of that inflated baseline. Agents do not merely add to inference demand. They multiply it by orders of magnitude — and they are just beginning to deploy at scale.
- The cost of GPT-4-class inference has fallen ~150× in 30 months — faster than PC compute, bandwidth during the dotcom era, or cloud storage.
- Google cut Gemini serving unit costs by 78% in 2025, yet token throughput grew 134× in 18 months, Cloud AI revenue grew 400% YoY, and 2026 CapEx was guided to $175–185B.
- OpenAI slashed API prices by 90%+ since GPT-4, yet revenue grew from $1B to $13B+. Anthropic’s inference costs exceeded projections by 23% because demand grew faster than efficiency gains.
- Nadella: 10× software optimization per model generation plus 2× per hardware generation. Amodei: optimization is “just beginning.”
- At $60/MTok only high-value queries justified API calls. At $0.40/MTok, background agents running continuously become economical. The addressable market expands by orders of magnitude with each price reduction.
The skeptical view of cost declines is that they reduce total spend and shrink the infrastructure opportunity. The empirical record says the opposite. Every price reduction unlocks a new class of application — agents, background processing, always-on copilots — that generates 10–100× more tokens per user session than the workload it supplements. The volume effect overwhelms the price effect completely. Google’s case is dispositive: 78% cost reduction, 134× volume growth, 400% revenue growth, $175B+ in forward capex. The demand created by cheaper inference dwarfs the demand it replaces. Cost declines do not shrink the market; they detonate it.
- Google: 9.7 trillion tokens/month (May 2024) → 480 trillion (May 2025) → 980 trillion (July 2025) → 1.3 quadrillion (October 2025). A 134× increase in 18 months.
- Google’s AI infrastructure lead stated internally the company must double serving capacity every six months.
- OpenAI: processing over 6 billion tokens per minute on the API as of October 2025, with 4 million developers. API traffic doubled within 48 hours of GPT-5’s launch.
- Google processes 7 billion tokens per minute via direct customer API usage alone by Q3 2025.
Token throughput is the closest available proxy for aggregate inference demand because it directly measures computation performed. The numbers are not projections or forecasts — they are disclosed actuals from the companies with the largest visibility into AI compute consumption. A 134× increase in 18 months, with the rate of growth itself accelerating (doubling from May to July, then continuing), indicates the demand curve is not flattening. The fact that Google must double serving capacity every six months — and is still falling behind — confirms that supply is not catching up to demand despite record-breaking infrastructure expansion.
- ChatGPT: 300M WAU (Dec ’24) → 900M WAU (Dec ’25). 2.5 billion prompts per day.
- Meta AI: 1 billion MAU in Q1 2025 — faster than any AI product in history.
- Gemini app: 650 million MAU by Q3 2025, queries tripling from Q2.
- Developer adoption: Google 7M developers on Gemini (5× YoY), Vertex AI usage up 40×. OpenAI 4M developers. Anthropic Claude Code $2.5B ARR by February 2026.
- Amodei described coding as the “strongest leading indicator of the AI capabilities explosion.”
Inference demand scales with two variables: the number of endpoints generating requests, and the depth of each endpoint’s engagement. Consumer adoption provides breadth (billions of users). Developer adoption provides depth — and crucially, persistence. Every developer building an AI-powered application creates a workload that runs continuously and scales with their own user base, not just with their personal usage. This is the demand surface expanding in two dimensions simultaneously: more humans using AI directly, and more software systems embedding AI as infrastructure. The developer metric is the leading indicator because each developer-integrated application becomes a permanent, scaling inference consumer.
- OpenAI: enterprise seats grew 9× YoY, weekly enterprise messages increased 8×, 3M+ paying business users. API business grew faster than ChatGPT consumer.
- Anthropic: ~80% of revenue from business API. ARR from $1B (Jan 2025) to $19B (Mar 2026). 500+ customers at $1M+ annually. Eight of Fortune 10 are Claude customers.
- Google Cloud: revenue grew 48% Q4 2025, backlog surged 55% QoQ to $240B. AI-model-driven revenue grew ~400% YoY. In Dec 2025 alone, ~350 customers each processed 100B+ tokens.
- AWS: $142B run rate, AI capacity monetized “as fast as we can install it.” Microsoft Azure: 39% growth, demand “continues to exceed supply.” 80% of CIOs plan to adopt Copilot within 12 months.
- Nadella: “only at the beginning phases of AI diffusion.” Jassy: “the lion’s share of that demand is still yet to come.” Amodei: capabilities will “spill over into the real world on a large scale” in 2026.
Enterprise workloads differ from consumer chat in three critical ways: they are persistent (running 24/7/365), high-volume (millions of transactions per day), and growing in complexity (from simple classification to multi-step agent workflows). The convergence of signals is striking — every frontier lab reports enterprise as the fastest-growing segment; every hyperscaler reports demand exceeding supply; every CEO frames the current moment as early-innings. The specific metric of 350 Google Cloud customers each processing 100 billion tokens in a single month demonstrates that enterprise inference is already operating at scale that dwarfs consumer usage per customer. And the leaders closest to the demand unanimously state the S-curve is inflecting, not topping.
- OpenAI: $1B (2023) → $3.7B (2024) → ~$13B+ (2025). Altman suggesting $100B revenue by 2027.
- Anthropic: revenue growing ~10× yearly. Reaching $19B ARR by March 2026 — estimated 1,167% YoY growth. Most revenue from API inference.
- Altman: OpenAI is profitable on inference and would be profitable overall if not investing in training next-generation models.
- Nadella confirmed OpenAI has “beaten every business plan projection” Microsoft has seen.
- Every frontier lab CEO reports the same: demand exceeds supply and the constraint is infrastructure, not willingness to pay. Altman says he could double revenue overnight with double the compute.
Revenue is the hardest signal available. It is not survey data, adoption forecasts, or management optimism — it is customers paying money for inference compute. Anthropic’s case is particularly telling: with ~80% of revenue from API inference, its revenue curve is nearly a pure proxy for enterprise inference demand growth. The fact that these companies are profitable on inference (not subsidizing usage) and still supply-constrained (not demand-constrained) means the revenue figures understate the true demand. If Altman could double revenue with double the compute, then the observable revenue is bounded by supply, not by willingness to pay. Amodei’s analogy of rice on a chessboard captures it: we are on the 40th square, and the shocks from the first 39 combined are a fraction of what’s ahead.
- Jassy described the AI compute market as a barbell: frontier labs on one end, productivity workloads on the other. The middle — enterprise production workloads — “very well may end up being the largest and the most durable.”
- A Fortune 500 company may run 1,000 to 10,000+ custom internal tools, hundreds of databases, tens of thousands of scripts. Each represents accumulated business logic that exists nowhere else.
- Enterprise applications are deeply interconnected: one action cascades through four or more systems. AI embedded in one application accelerates an entire chain — the amplification dynamic.
- Proprietary data creates inference workloads no public model can replicate. Amazon launched Nova Forge for injecting proprietary data into model pretraining.
- Visa’s fraud detection runs on every transaction 24/7. Salesforce agents execute 3 billion workflows/month. Every deployment is a persistent inference workload that runs indefinitely.
The enterprise workload opportunity is structurally different from the consumer and API stories in three ways. First, the scale: millions of private applications, each a candidate for AI integration. Second, the amplification: interconnected systems mean the compute demand is proportional not to the number of applications but to the number of connections between them — a nonlinear relationship in large enterprises. Third, the durability: proprietary data creates unique value that makes each workload irreplaceable and permanent. These are not discretionary features that can be turned off in a downturn. They are operational improvements embedded in the way the enterprise runs — persistent, 24/7 inference consumers that exist for as long as the enterprise itself operates.
- The AI coding agent market grew from ~$500M run-rate (end of 2024) to $5–6B by Q4 2025 — 10× in a single year.
- Claude Code: $0 to $2.5B ARR in nine months. 4% of all public GitHub commits authored by Claude Code, projections exceeding 20% by year-end. Business subscriptions quadrupled since January 2026.
- OpenAI Codex: 2M+ WAU by March 2026, usage up 5× since January. GitHub Copilot: 4.7M paid subscribers. Google Antigravity: 1.5M WAU.
- One Anthropic engineer shipped 300+ pull requests in a month running five parallel Claude Code agents — the output of an entire small team. Meta CFO: output per engineer rose 30%, power users up 80%.
- 84% of developers now use AI tools in workflow, 51% daily. Codex sessions run autonomously for 7+ hours on a single task.
- 28 million developers worldwide, developer compensation 50–70% of enterprise IT budgets. Enterprise IT departments carry 12–24 month backlogs.
The productivity framing dramatically understates what is happening. AI coding tools create three distinct sources of inference demand. First, the direct demand: every line of code written, reviewed, or analyzed by an AI model consumes inference compute continuously, across millions of developers, indefinitely. Second, the recursive demand: AI makes developers faster, faster developers build more AI applications, those applications generate inference workloads, and the cycle accelerates. Third, the strategic demand: when developers use AI on production codebases, the models gain visibility into how enterprises actually operate — the workflows, data schemas, and business logic encoded in millions of lines of code. This is not a feature. It is a channel into the operational core of every enterprise on Earth. One Google principal engineer acknowledged that Claude Code reproduced a year of architectural work in one hour. The revenue trajectory ($500M to $5–6B in one year) confirms the market recognizes this.
- McKinsey global survey (~2,000 organizations): AI adoption rose from 20% (2017) to 88% (2025). Gen AI usage surged from 33% to 79% in two years. Share of companies deploying AI across 3+ functions tripled from 17% to 51% since 2021.
- BCG Build for the Future study (1,250 companies): frontier AI-maturity companies deliver 1.7× revenue growth, 3.6× three-year TSR, 2.7× ROIC. They spend 120% more on AI than laggards.
- BCG: share of AI-driven value from agentic systems expected to nearly double by 2028. 46% of companies already experimenting with agents, 30% allocating >15% of AI budgets to agentic workloads.
- BCG CEO data point analysis of 6,027 earnings calls: “AI Agents,” “Agentic AI,” and “AI Tools” are the fastest-growing keyword clusters. “AI Infrastructure,” “Hyperscaler,” “GPU,” “Data Centers” all in high-growth quadrant.
Every major enterprise technology adoption in the last three decades — ERP, cloud, mobile — followed the same pattern: technology emerges, early adopters demonstrate results, consulting firms systematize the transition for the rest of the market. The consulting ecosystem does not merely measure adoption. It causes it. When McKinsey publishes that 88% of organizations use AI and BCG shows frontier adopters capturing 3.6× the shareholder return, a CEO who reads those findings does not have the option of waiting. The consulting flywheel is now spinning: measure adoption, publish findings, create urgency, staff implementations, measure ROI, publish again. Each rotation converts more enterprise workloads from evaluation into production — and from production into the persistent inference infrastructure that sustains them. The keyword analysis of 6,027 earnings calls confirms this is no longer abstract: CEOs are discussing agents, infrastructure, and GPUs in specific, deployment-oriented terms.