Argument Architecture

Argument 1 Inference — not training — is the dominant and growing claim on compute

Claim

Training costs are enormous but bounded — a one-time capex event per model generation. Inference costs scale with every user, every query, and every application, continuously and without ceiling. Inference already accounts for the majority of AI workload and its share is widening.

Evidence

Training produces a static artifact: a set of weights frozen at a point in time. It happens once per model generation.
Inference runs every time someone uses the model — a forward pass through every layer, token by token, for every question answered, every line of code generated, every agent action taken.
Inference already accounts for the majority of AI workload on hyperscaler infrastructure, and the ratio is widening.
Altman stated: “Most of what we’re building out at this point is the inference.”

Reasoning

The asymmetry is structural, not incidental. Training costs scale with how many models you build — a finite, episodic number. Inference costs scale with how many people use them, how often, and how much computation each request demands. These are continuous, unbounded variables. As the number of users, applications, and per-query compute all grow simultaneously, inference becomes the dominant and ever-increasing claim on the world’s AI compute infrastructure. Training is the factory construction. Inference is the assembly line that never stops.

Therefore

The future of AI compute demand is overwhelmingly an inference story. Any analysis of compute needs that focuses primarily on training misses the larger, faster-growing, and structurally unbounded portion of demand.

Argument 2 The per-query cost of intelligence is escalating structurally

Claim

Even if users, requests, and tasks held perfectly flat, total inference compute would still be growing — because each generation of model architecture consumes more compute per response. This is the ever-present multiplier underneath every demand trend.

Evidence

Chain-of-thought reasoning: a question that once generated 300 tokens now generates 300 visible tokens plus 5,000 to 50,000 reasoning tokens that consume GPU cycles but never appear to the user. Same useful output, an order of magnitude more compute.
Tree-of-thought strategies branch into multiple parallel reasoning paths, generating and discarding entire sequences. Verification loops run inference on inference.
Gemini Flash 2.5 uses roughly 17× more tokens per request than its predecessor.
The research trajectory points explicitly toward architectures that trade more inference compute for better results. Models get smarter because they consume more compute at inference time.

Reasoning

This is not a one-time step change. It is a continuous escalation baked into the direction of AI research. Every improvement in model capability is being achieved by scaling computation at inference time — test-time compute. The models are not getting cheaper to run as they get smarter; they are getting more expensive per query because they are getting smarter. This escalating per-query cost acts as a hidden multiplier on every other demand driver: every new user, every new application, every new modality of consumption runs on a treadmill that is itself accelerating.

Therefore

Per-query compute intensity is a structural growth driver independent of user growth. It compounds with every other demand factor, meaning total inference compute grows faster than any single metric — users, queries, or applications — would suggest in isolation.

Argument 3 Chat establishes a rising floor of inference demand

Claim

Chat is the baseline of inference consumption — rate-limited by human reading speed, but serving hundreds of millions of users whose per-query compute payload inflates with each model generation. It is a floor that is rising.

Evidence

ChatGPT tripled weekly active users from 300 million (Dec ’24) to 900 million (Dec ’25). Users send roughly 2.5 billion prompts per day.
Meta AI crossed 1 billion monthly active users in Q1 2025 — faster than any AI product in history.
Gemini app reached 650 million monthly active users by Q3 2025, with queries tripling from Q2.
A 2025-era chat interaction with a reasoning model performs 10–50× the computation of a 2023-era interaction, due to chain-of-thought reasoning becoming the default mode.

Reasoning

Chat’s growth potential as an inference driver is often underestimated because it is constrained by human speed — synchronous, a handful of exchanges per session. But the constraint applies to frequency, not to intensity. The compute cost per interaction is rising with each model generation as reasoning becomes the default, and the user base is measured in hundreds of millions. Even at human pace, hundreds of millions of users generating requests that each carry an inflating compute payload produce enormous aggregate demand. Chat is not the growth story — but it is the floor beneath every growth story, and that floor rises automatically with each model upgrade.

Therefore

Chat alone sustains a massive and growing baseline of inference demand. Every other modality of consumption — API, agents — adds on top of this already-rising floor.

Argument 4 API usage multiplies inference demand beyond human speed

Claim

The API layer is where inference demand detaches from human speed. Programmatic volume, expanded context lengths, and chained calls create persistent, automated streams of inference consumption that scale with application usage, not human sessions.

Evidence

A single application can issue millions of inference calls per day — coding assistants evaluating every pull request, customer platforms routing every message, search engines rewriting every query.
API use cases routinely push into long-context territory: entire codebases at 50,000–200,000 tokens, full contracts, complete earnings transcripts. A 100,000-token input is architecturally more demanding than 100× a 1,000-token input.
Compound pipelines are now standard: a RAG pipeline requires a minimum of two LLM passes for one user-visible interaction. A three-call pipeline using a reasoning model is 3× a call that is itself 10–50× more expensive than eighteen months ago.
Google reported 7 million developers building with Gemini (5× YoY), Vertex AI usage up 40×. OpenAI counts 4 million developers.

Reasoning

Three dynamics change simultaneously when developers integrate model access: volume becomes programmatic (millions of calls per day), context lengths expand (scaling compute super-linearly), and calls chain (multiplying per-interaction cost). Developers select the most capable models because capability is product quality, and the most capable models are the most compute-intensive. Smarter models attract harder tasks. Harder tasks consume more compute. Every SaaS platform, developer tool, and enterprise workflow adding an “AI feature” adds a new persistent stream of inference demand whose per-request intensity escalates with each model generation.

Therefore

API-driven inference demand is a multiplier on the chat baseline — operating at machine speed, at expanded context, with chained architectures. It grows with the number of AI-integrated applications, which is itself growing exponentially.

Argument 5 Agents represent a step-function increase in per-task compute

Claim

Agents are not a request-response pair but autonomous loops that compound inference calls — each using the most compute-intensive form of reasoning — across dozens or hundreds of steps, at machine speed, with escalating context. A single agent task consumes 100–1,000× the compute of a chat interaction.

Evidence

A single agent task routinely consumes hundreds of thousands to over 1,000,000 tokens. A coding agent reads files, writes code, runs tests, interprets errors, and iterates — each step a full inference call with growing context.
Every step in an agent loop is a decision point where extended reasoning produces the largest quality gains. A 30-step task with 10,000 tokens of chain-of-thought per step generates 300,000 reasoning tokens alone.
Agents eliminate the human-speed bottleneck entirely. Sophisticated workflows spawn multiple agents in parallel, each consuming inference at full throughput.
Compounding context pressure: by mid-task, the agent operates near maximum context length on every call. The cost curve within a single task accelerates, not just grows linearly.
Jassy noted companies are “just starting to think about deploying AI agents.” Anthropic’s economic index: three-quarters of businesses using Claude do so for “full task delegation.”

Reasoning

The agent paradigm is multiplicative across every dimension simultaneously: token volume per task (100–1,000× chat), reasoning at every step (compounding the most compute-intensive form of inference), machine speed (no human pauses), and escalating context (each step more expensive than the last). The arithmetic is explicit: a 2023 chat consumed X tokens; a 2025 chat with reasoning consumes 5–30X; an API pipeline consumes 10–100X of the new baseline; a single agent task consumes 100–1,000X of that inflated baseline. Agents do not merely add to inference demand. They multiply it by orders of magnitude — and they are just beginning to deploy at scale.

Therefore

Agentic AI is the single largest inference multiplier on the horizon. As agents move from experimentation to production, inference demand will undergo a step-function increase that dwarfs the growth from chat and API combined.

Argument 6 Inference cost declines drive demand expansion, not contraction

Claim

The relationship between inference cost and inference demand is hyperelastic. Every major price reduction has been followed by a demand surge that more than compensated — usually by an order of magnitude. This is Jevons Paradox at industrial scale, and it makes inference demand structurally insatiable.

Evidence

The cost of GPT-4-class inference has fallen ~150× in 30 months — faster than PC compute, bandwidth during the dotcom era, or cloud storage.
Google cut Gemini serving unit costs by 78% in 2025, yet token throughput grew 134× in 18 months, Cloud AI revenue grew 400% YoY, and 2026 CapEx was guided to $175–185B.
OpenAI slashed API prices by 90%+ since GPT-4, yet revenue grew from $1B to $13B+. Anthropic’s inference costs exceeded projections by 23% because demand grew faster than efficiency gains.
Nadella: 10× software optimization per model generation plus 2× per hardware generation. Amodei: optimization is “just beginning.”
At $60/MTok only high-value queries justified API calls. At $0.40/MTok, background agents running continuously become economical. The addressable market expands by orders of magnitude with each price reduction.

Reasoning

The skeptical view of cost declines is that they reduce total spend and shrink the infrastructure opportunity. The empirical record says the opposite. Every price reduction unlocks a new class of application — agents, background processing, always-on copilots — that generates 10–100× more tokens per user session than the workload it supplements. The volume effect overwhelms the price effect completely. Google’s case is dispositive: 78% cost reduction, 134× volume growth, 400% revenue growth, $175B+ in forward capex. The demand created by cheaper inference dwarfs the demand it replaces. Cost declines do not shrink the market; they detonate it.

Therefore

Inference cost declines are demand accelerants, not threats to the infrastructure thesis. Each reduction expands the addressable market faster than it compresses per-unit revenue, creating a self-reinforcing cycle of growing total compute consumption.

Argument 7 Token volume growth confirms demand is accelerating at unprecedented scale

Claim

Token throughput — the most direct proxy for aggregate inference demand — is growing at rates that have no precedent in computing history. The trajectory is accelerating, not plateauing.

Evidence

Google: 9.7 trillion tokens/month (May 2024) → 480 trillion (May 2025) → 980 trillion (July 2025) → 1.3 quadrillion (October 2025). A 134× increase in 18 months.
Google’s AI infrastructure lead stated internally the company must double serving capacity every six months.
OpenAI: processing over 6 billion tokens per minute on the API as of October 2025, with 4 million developers. API traffic doubled within 48 hours of GPT-5’s launch.
Google processes 7 billion tokens per minute via direct customer API usage alone by Q3 2025.

Reasoning

Token throughput is the closest available proxy for aggregate inference demand because it directly measures computation performed. The numbers are not projections or forecasts — they are disclosed actuals from the companies with the largest visibility into AI compute consumption. A 134× increase in 18 months, with the rate of growth itself accelerating (doubling from May to July, then continuing), indicates the demand curve is not flattening. The fact that Google must double serving capacity every six months — and is still falling behind — confirms that supply is not catching up to demand despite record-breaking infrastructure expansion.

Therefore

Token volume data eliminates the possibility that inference demand is approaching saturation. The observed trajectory is exponential, accelerating, and outpacing supply additions at every measured interval.

Argument 8 User and developer adoption is expanding the demand surface exponentially

Claim

The number of users and developers building on AI models is growing at rates that expand the total inference demand surface — and every developer who integrates AI creates a persistent workload that scales with their own user base.

Evidence

ChatGPT: 300M WAU (Dec ’24) → 900M WAU (Dec ’25). 2.5 billion prompts per day.
Meta AI: 1 billion MAU in Q1 2025 — faster than any AI product in history.
Gemini app: 650 million MAU by Q3 2025, queries tripling from Q2.
Developer adoption: Google 7M developers on Gemini (5× YoY), Vertex AI usage up 40×. OpenAI 4M developers. Anthropic Claude Code $2.5B ARR by February 2026.
Amodei described coding as the “strongest leading indicator of the AI capabilities explosion.”

Reasoning

Inference demand scales with two variables: the number of endpoints generating requests, and the depth of each endpoint’s engagement. Consumer adoption provides breadth (billions of users). Developer adoption provides depth — and crucially, persistence. Every developer building an AI-powered application creates a workload that runs continuously and scales with their own user base, not just with their personal usage. This is the demand surface expanding in two dimensions simultaneously: more humans using AI directly, and more software systems embedding AI as infrastructure. The developer metric is the leading indicator because each developer-integrated application becomes a permanent, scaling inference consumer.

Therefore

The demand surface for inference is expanding on both the consumer and developer axes simultaneously. Developer adoption is the more consequential metric because it converts individual model access into persistent, scalable application workloads.

Argument 9 Enterprise adoption is early, accelerating, and inference-intensive

Claim

Enterprise AI adoption is inflecting — not plateauing. The workloads enterprises deploy are persistent, high-volume, and growing in complexity. Every signal from frontier labs and hyperscalers confirms the enterprise diffusion curve is in its early phase.

Evidence

OpenAI: enterprise seats grew 9× YoY, weekly enterprise messages increased 8×, 3M+ paying business users. API business grew faster than ChatGPT consumer.
Anthropic: ~80% of revenue from business API. ARR from $1B (Jan 2025) to $19B (Mar 2026). 500+ customers at $1M+ annually. Eight of Fortune 10 are Claude customers.
Google Cloud: revenue grew 48% Q4 2025, backlog surged 55% QoQ to $240B. AI-model-driven revenue grew ~400% YoY. In Dec 2025 alone, ~350 customers each processed 100B+ tokens.
AWS: $142B run rate, AI capacity monetized “as fast as we can install it.” Microsoft Azure: 39% growth, demand “continues to exceed supply.” 80% of CIOs plan to adopt Copilot within 12 months.
Nadella: “only at the beginning phases of AI diffusion.” Jassy: “the lion’s share of that demand is still yet to come.” Amodei: capabilities will “spill over into the real world on a large scale” in 2026.

Reasoning

Enterprise workloads differ from consumer chat in three critical ways: they are persistent (running 24/7/365), high-volume (millions of transactions per day), and growing in complexity (from simple classification to multi-step agent workflows). The convergence of signals is striking — every frontier lab reports enterprise as the fastest-growing segment; every hyperscaler reports demand exceeding supply; every CEO frames the current moment as early-innings. The specific metric of 350 Google Cloud customers each processing 100 billion tokens in a single month demonstrates that enterprise inference is already operating at scale that dwarfs consumer usage per customer. And the leaders closest to the demand unanimously state the S-curve is inflecting, not topping.

Therefore

Enterprise adoption is the critical multiplier for inference demand. It converts AI from a consumer novelty into persistent infrastructure workload — and by every available signal, the adoption curve is still in its steep early phase.

Argument 10 Revenue trajectories at frontier labs directly proxy inference demand

Claim

Revenue at frontier labs — overwhelmingly driven by per-token API consumption — is the most direct monetization proxy for inference demand. The trajectories are extraordinary and confirm that demand is outrunning supply.

Evidence

OpenAI: $1B (2023) → $3.7B (2024) → ~$13B+ (2025). Altman suggesting $100B revenue by 2027.
Anthropic: revenue growing ~10× yearly. Reaching $19B ARR by March 2026 — estimated 1,167% YoY growth. Most revenue from API inference.
Altman: OpenAI is profitable on inference and would be profitable overall if not investing in training next-generation models.
Nadella confirmed OpenAI has “beaten every business plan projection” Microsoft has seen.
Every frontier lab CEO reports the same: demand exceeds supply and the constraint is infrastructure, not willingness to pay. Altman says he could double revenue overnight with double the compute.

Reasoning

Revenue is the hardest signal available. It is not survey data, adoption forecasts, or management optimism — it is customers paying money for inference compute. Anthropic’s case is particularly telling: with ~80% of revenue from API inference, its revenue curve is nearly a pure proxy for enterprise inference demand growth. The fact that these companies are profitable on inference (not subsidizing usage) and still supply-constrained (not demand-constrained) means the revenue figures understate the true demand. If Altman could double revenue with double the compute, then the observable revenue is bounded by supply, not by willingness to pay. Amodei’s analogy of rice on a chessboard captures it: we are on the 40th square, and the shocks from the first 39 combined are a fraction of what’s ahead.

Therefore

Frontier lab revenue growth validates that inference demand is real, monetizable, and supply-constrained. The observed revenue underestimates true demand because it is bounded by available compute, not by the market’s willingness to consume.

Argument 11 The enterprise workload opportunity is the largest and most durable demand source

Claim

The middle of the barbell — millions of private, internal, legacy enterprise applications — represents the largest, most durable, and least appreciated source of future inference demand. This “invisible majority” of enterprise software is barely beginning to incorporate AI, and its interconnected nature creates nonlinear compute requirements.

Evidence

Jassy described the AI compute market as a barbell: frontier labs on one end, productivity workloads on the other. The middle — enterprise production workloads — “very well may end up being the largest and the most durable.”
A Fortune 500 company may run 1,000 to 10,000+ custom internal tools, hundreds of databases, tens of thousands of scripts. Each represents accumulated business logic that exists nowhere else.
Enterprise applications are deeply interconnected: one action cascades through four or more systems. AI embedded in one application accelerates an entire chain — the amplification dynamic.
Proprietary data creates inference workloads no public model can replicate. Amazon launched Nova Forge for injecting proprietary data into model pretraining.
Visa’s fraud detection runs on every transaction 24/7. Salesforce agents execute 3 billion workflows/month. Every deployment is a persistent inference workload that runs indefinitely.

Reasoning

The enterprise workload opportunity is structurally different from the consumer and API stories in three ways. First, the scale: millions of private applications, each a candidate for AI integration. Second, the amplification: interconnected systems mean the compute demand is proportional not to the number of applications but to the number of connections between them — a nonlinear relationship in large enterprises. Third, the durability: proprietary data creates unique value that makes each workload irreplaceable and permanent. These are not discretionary features that can be turned off in a downturn. They are operational improvements embedded in the way the enterprise runs — persistent, 24/7 inference consumers that exist for as long as the enterprise itself operates.

Therefore

The enterprise production workload is where the structural, recurring, long-duration inference demand will come from. It is the demand that transforms compute infrastructure from a technology investment into an essential utility — and it is barely getting started.

Argument 12 AI coding tools are the hidden strategic channel into enterprise compute

Claim

AI coding tools are not merely a productivity feature. They are the strategic channel through which AI labs gain visibility into enterprise codebases, accelerate the deployment of AI applications, and create a recursive loop where AI builds the systems that consume more AI. The inference implications are enormous and compounding.

Evidence

The AI coding agent market grew from ~$500M run-rate (end of 2024) to $5–6B by Q4 2025 — 10× in a single year.
Claude Code: $0 to $2.5B ARR in nine months. 4% of all public GitHub commits authored by Claude Code, projections exceeding 20% by year-end. Business subscriptions quadrupled since January 2026.
OpenAI Codex: 2M+ WAU by March 2026, usage up 5× since January. GitHub Copilot: 4.7M paid subscribers. Google Antigravity: 1.5M WAU.
One Anthropic engineer shipped 300+ pull requests in a month running five parallel Claude Code agents — the output of an entire small team. Meta CFO: output per engineer rose 30%, power users up 80%.
84% of developers now use AI tools in workflow, 51% daily. Codex sessions run autonomously for 7+ hours on a single task.
28 million developers worldwide, developer compensation 50–70% of enterprise IT budgets. Enterprise IT departments carry 12–24 month backlogs.

Reasoning

The productivity framing dramatically understates what is happening. AI coding tools create three distinct sources of inference demand. First, the direct demand: every line of code written, reviewed, or analyzed by an AI model consumes inference compute continuously, across millions of developers, indefinitely. Second, the recursive demand: AI makes developers faster, faster developers build more AI applications, those applications generate inference workloads, and the cycle accelerates. Third, the strategic demand: when developers use AI on production codebases, the models gain visibility into how enterprises actually operate — the workflows, data schemas, and business logic encoded in millions of lines of code. This is not a feature. It is a channel into the operational core of every enterprise on Earth. One Google principal engineer acknowledged that Claude Code reproduced a year of architectural work in one hour. The revenue trajectory ($500M to $5–6B in one year) confirms the market recognizes this.

Therefore

AI coding tools are simultaneously a massive direct inference workload, a recursive accelerant for AI application deployment, and a strategic channel for AI integration into enterprise infrastructure. Their inference impact compounds across all three dimensions.

Argument 13 The consulting ecosystem systematically converts interest into deployed inference

Claim

The management consulting industry — McKinsey, BCG, Deloitte, Accenture, PwC — functions as a transmission mechanism that converts early-mover AI results into industry-wide urgency, urgency into budgets, and budgets into production inference workloads. Their involvement is itself a structural demand signal.

Evidence

McKinsey global survey (~2,000 organizations): AI adoption rose from 20% (2017) to 88% (2025). Gen AI usage surged from 33% to 79% in two years. Share of companies deploying AI across 3+ functions tripled from 17% to 51% since 2021.
BCG Build for the Future study (1,250 companies): frontier AI-maturity companies deliver 1.7× revenue growth, 3.6× three-year TSR, 2.7× ROIC. They spend 120% more on AI than laggards.
BCG: share of AI-driven value from agentic systems expected to nearly double by 2028. 46% of companies already experimenting with agents, 30% allocating >15% of AI budgets to agentic workloads.
BCG CEO data point analysis of 6,027 earnings calls: “AI Agents,” “Agentic AI,” and “AI Tools” are the fastest-growing keyword clusters. “AI Infrastructure,” “Hyperscaler,” “GPU,” “Data Centers” all in high-growth quadrant.

Reasoning

Every major enterprise technology adoption in the last three decades — ERP, cloud, mobile — followed the same pattern: technology emerges, early adopters demonstrate results, consulting firms systematize the transition for the rest of the market. The consulting ecosystem does not merely measure adoption. It causes it. When McKinsey publishes that 88% of organizations use AI and BCG shows frontier adopters capturing 3.6× the shareholder return, a CEO who reads those findings does not have the option of waiting. The consulting flywheel is now spinning: measure adoption, publish findings, create urgency, staff implementations, measure ROI, publish again. Each rotation converts more enterprise workloads from evaluation into production — and from production into the persistent inference infrastructure that sustains them. The keyword analysis of 6,027 earnings calls confirms this is no longer abstract: CEOs are discussing agents, infrastructure, and GPUs in specific, deployment-oriented terms.

Therefore

The consulting ecosystem is a structural accelerant for inference demand. It operates as a flywheel that systematically converts enterprise awareness into deployment, deployment into ROI evidence, and evidence into further urgency — each cycle generating more persistent inference workload.

The Overarching Argument

If you accept Arguments 1 through 13 in sequence, the composite thesis is this:

Inference — not training — is the dominant claim on AI compute (Argument 1), and the cost per query is escalating structurally with each model generation (Argument 2). Three modalities of consumption — chat as floor (3), API as multiplier (4), agents as step function (5) — compound on top of each other, each consuming orders of magnitude more compute than the last. Falling inference costs do not reduce demand; they detonate it (Argument 6), as confirmed by token volume growth at 134× in 18 months (7) and the exponential expansion of the user and developer surface (8).

Enterprise adoption is the critical multiplier (Argument 9), validated by revenue trajectories that proxy inference demand directly and are still supply-constrained (10). The largest opportunity — the enterprise production workload — is barely beginning (Argument 11), with AI coding tools serving as both a massive direct workload and the strategic channel that accelerates enterprise AI deployment (12). The consulting ecosystem systematically converts all of this into production reality (13).

The convergence is unanimous. Every frontier lab CEO, every hyperscaler CEO, reports the same condition: demand exceeds supply, the drivers are compounding, and the constraint is infrastructure. Inference demand is not merely growing. It is approaching an inflection that the companies closest to it believe will reshape the economy. The question this thesis answers is why. The question that follows is who will supply the compute.

Argument
Architecture

How to Read This Document

The Overarching Argument

Sections

ArgumentArchitecture

How to Read This Document

The Overarching Argument

Argument
Architecture