I. From Training to Inference

Training produces the model. Everything examined so far — the scaling laws, the data pipelines, the massive clusters running for months — results in a static artifact: a set of weights frozen at a point in time. Training happens once per model generation. Its costs are enormous but bounded. And those costs, as significant as they are, represent only the first claim on compute.

Inference is the second — and it is the larger one. Inference is what happens every time someone uses the model. A prompt goes in, the weights are loaded into accelerator memory, and the network performs a forward pass — matrix multiplications through every layer — to produce a single token. Then it repeats, token by token, until the response is complete. Every question answered, every line of code generated, every agent action taken is an inference event. If training is building the factory, inference is running the assembly line — and the line never stops.

Training costs scale with how many models you build, but inference costs scale with how many people use them, how often, and how much computation each request demands.

Training is a capex event. Inference is an ongoing operational load that grows with every new user, every new application, and — as this section will demonstrate — every new generation of model architecture. Inference already accounts for the majority of AI workload on hyperscaler infrastructure, and that ratio is widening.

The Escalating Cost of Intelligence

Before examining how inference is consumed, a foundational dynamic must be established: the compute required to generate a single response is itself increasing — rapidly and structurally — as models get smarter.

Model capability is now being pushed forward by scaling computation at inference time — known as test-time compute. The core insight: a model produces substantially better outputs if it "thinks" before answering, generating intermediate reasoning tokens that explore the problem, evaluate approaches, and self-correct before committing to a response.

Chain-of-thought reasoning is the most visible implementation. Rather than mapping directly from prompt to answer, the model generates an extended internal trace — thousands or tens of thousands of tokens — before producing visible output. OpenAI's o-series, Anthropic's extended thinking, and DeepSeek's R1 all implement this. A question that once generated 300 tokens of output now generates 300 visible tokens plus 5,000 to 50,000 reasoning tokens that consume GPU cycles and memory but never appear to the user. Same useful output. An order of magnitude more compute.

More advanced techniques push further. Tree-of-thought strategies branch into multiple parallel reasoning paths, evaluate their promise, prune dead ends, and converge on the strongest. The model generates and discards entire sequences that never appear in any output. Verification loops layer on top: after generating an answer, the model re-reads its own output, checks for errors, and revises — running inference on its own inference.

The Inference Cost Escalation

This is not a one-time step change. It is a continuous escalation. The research trajectory points explicitly toward architectures and post-training techniques that trade more inference compute for better results. Models are getting smarter because they consume more compute at inference time, and each generation raises the baseline cost of what it means to generate a response. Even if users, requests, and tasks held perfectly flat, total inference compute would still be growing.

II. The Three Modalities of Inference Demand

A. Chat: The Floor

Chat is the most visible form of inference. A user types a prompt, receives a response. One in, one out, at human speed. This drove the initial adoption wave beginning in late 2022.

Chat's compute profile has already shifted. A circa-2023 interaction produced a few hundred tokens via direct generation. A 2025-era interaction with a reasoning model produces those same few hundred visible tokens plus thousands to tens of thousands of reasoning tokens underneath. The user sees the same concise answer. The cluster behind it performed 10–50x the computation. As providers make reasoning the default mode — which competitive pressure demands, because reasoning models produce better answers — the average compute cost per chat interaction rises even for the same users asking the same questions.

Chat remains the baseline rather than the primary growth driver because of its usage pattern: synchronous, human-speed, a handful of exchanges per session. Human reading speed rate-limits consumption. But hundreds of millions of users generating requests that each carry an inflating compute payload still produce enormous aggregate demand. Chat is the floor of inference consumption. It is a floor that is rising with every model generation.

B. API: The Multiplier

The API layer is where inference demand detaches from human speed. When developers integrate model access into applications, three dynamics change simultaneously.

The volume becomes programmatic. A single application can issue millions of inference calls per day — a coding assistant evaluating every pull request, a customer platform routing every message through classification and generation, a search engine rewriting every query. These are persistent, automated streams running at machine speed.

The context lengths expand. API use cases routinely push into long-context territory: entire codebases at 50,000–200,000 tokens, full contracts, complete earnings transcripts. The compute cost of processing context scales aggressively — a 100,000-token input is not 100x a 1,000-token input but architecturally more demanding due to attention computation overhead.

The calls chain. A RAG pipeline might reformulate a query, retrieve documents via an embedding model, then generate a final answer — a minimum of two LLM passes for one user-visible interaction. Compound pipelines are now standard architecture.

Smarter models attract harder tasks. Harder tasks consume more compute. The spiral feeds itself.

Developers select the most capable models because capability is product quality. The most capable models are reasoning models — the most compute-intensive per call. A three-call pipeline using a reasoning model is not 3x a single call; it is 3x a call that is itself 10–50x more expensive than the equivalent call eighteen months ago. And as models become more capable, developers delegate harder problems, demanding deeper reasoning and longer thinking traces.

Every SaaS platform, developer tool, and enterprise workflow adding an "AI feature" adds a new persistent stream of inference demand — a stream whose per-request intensity is escalating with each model generation.

C. Agents: The Step Function

Agents represent a qualitative shift in inference economics. An agent is not a request-response pair. It is an autonomous loop: receive a task, reason about it, take an action (call a tool, write code, search the web, read a file), observe the result, reason again about what to do next. This repeats — dozens or hundreds of times — until the task is complete.

The compute implications are multiplicative across every dimension simultaneously.

Token volume per task. Where a chat exchange might consume 1,000–3,000 visible tokens (10,000–50,000 total with reasoning) and an API pipeline might consume 10,000–200,000, a single agent task routinely consumes hundreds of thousands to over 1,000,000 tokens. A coding agent implementing a feature reads files, writes code, runs tests, interprets errors, revises, and iterates — each step a full inference call with a growing context window accumulating the history of every prior action.

Reasoning at every step. This is where the escalating cost of intelligence collides most forcefully with the agent paradigm. Every step in an agent loop is a decision point: evaluate state, consider options, choose an action. These are exactly the problems where extended reasoning produces the largest quality gains. A 30-step task where each step invokes 10,000 tokens of chain-of-thought reasoning generates 300,000 reasoning tokens alone — before counting input context, tool outputs, or visible responses.

Autonomy at machine speed. Agents eliminate the human-speed bottleneck entirely. A user launches a task and walks away. No reading pause, no thinking time, no context switch between steps. The agent consumes compute continuously at machine speed until termination. Sophisticated workflows spawn multiple agents in parallel — one researching, one coding, one reviewing — each independently consuming inference resources at full throughput.

Compounding context pressure. Agents accumulate state. Each observation appends to the running context. By the midpoint of a complex task, the agent operates near maximum context length on every call. Each step is not just one more inference call — it is an increasingly expensive inference call, as growing context increases both prefill compute and KV-cache memory requirements. The cost curve within a single task is not linear. It accelerates.

III. Drivers of the Inflection: Why Demand Compounds

The frontier lab leaders identify several compounding drivers that will push inference demand into a new regime. These are not speculative — they are grounded in current product trajectories and disclosed data.

Agents & Autonomous Workloads

Agentic AI is the single largest inference multiplier on the horizon. A human typing a query generates a handful of API calls. An agent performing a multi-step task — booking travel, debugging code, managing a pipeline — generates orders of magnitude more inference. Amazon CEO Andy Jassy noted companies are "just starting to think about deploying AI agents." Microsoft CEO Satya Nadella said 80% of CIOs plan to adopt Copilot within 12 months. Anthropic's economic index found that three-quarters of businesses using Claude do so for "full task delegation." Each autonomous workflow is a persistent, high-volume inference consumer.

Reasoning & Thinking Models

Thinking models — like o-series, Gemini 2.5 Pro, and Claude's extended thinking — consume dramatically more tokens per query. Analysis showed Gemini Flash 2.5 uses roughly 17× more tokens per request than its predecessor.

Model Capability Overhang

Nadella introduced the concept of "model overhang" — the gap between what models can do and what users are actually using them for. Altman said models have "already saturated the chat use case" but their capabilities extend far beyond chat. As enterprises close this adoption gap, inference demand unlocks in step-function increments. Amodei stated at Morgan Stanley in March 2026 that what labs see internally "is far more crazy than what the outside world perceives" and that in 2026, those capabilities will "spill over into the real world on a large scale."

Price Declines Drive Demand Expansion (Jevons Paradox)

Google lowered Gemini serving unit costs by 78% in 2025. Inference prices across all providers have fallen dramatically. But rather than reducing total spend, cheaper inference unlocks new use cases and higher volumes. Amodei noted that Anthropic is "just beginning to optimize for inference." Every reduction in per-token cost expands the addressable market and increases total tokens consumed. The frontier labs' revenue is growing despite steep price cuts — the volume effect dominates.

The Enterprise Diffusion Curve Is Early

Despite explosive growth, penetration remains low. Nadella described the current moment as "only at the beginning phases of AI diffusion." Altman said 2025 was the year enterprise growth outpaced consumer for the first time, with enterprise now a "major priority" for 2026. Jassy said "the lion's share of that demand is still yet to come" and predicted the middle phase of enterprise adoption "may end up being the largest and the most durable" part of the AI market. Zuckerberg stated that Meta's demand for compute resources "increased even faster than our supply" despite massive GPU buildouts. The S-curve of enterprise adoption is inflecting, not topping.

IV. Demand Signals: Frontier Model Labs

Inference demand is not approaching saturation. It is accelerating toward a massive inflection. The evidence from the companies with the deepest visibility into this market converges on a single conclusion: demand is outrunning supply at every layer of the stack, and the drivers of future demand are compounding, not linear.

This synthesis draws on five interlocking bodies of evidence, each sourced from frontier lab and hyperscaler primary disclosures: (1) token volume growth at unprecedented scale, (2) user and developer adoption trajectories, (3) enterprise spending signals, (4) revenue acceleration that directly proxies inference consumption, and (5) capital expenditure commitments that reflect private demand signals invisible to public markets.

Token Volumes: The Most Direct Measure of Inference

Token throughput is the closest available proxy for aggregate inference demand. Google's disclosures provide the most granular public time series. At Google I/O in May 2025, Sundar Pichai revealed that Google's products and APIs were processing 480 trillion tokens per month — a 50× increase from 9.7 trillion just twelve months prior. By July 2025, that figure had doubled again to 980 trillion. By October 2025, it crossed 1.3 quadrillion tokens per month.

That is a 134× increase in 18 months.

Exhibit 1Google monthly tokens processed across products & APIs

Source: Sundar Pichai, Google I/O '25, Q2/Q3/Q4 '25 earnings calls.

Google's AI infrastructure lead, Amin Vahdat, told employees internally that the company must double its serving capacity every six months to keep pace with demand. On the API side specifically, by Q3 2025, Pichai disclosed that first-party models like Gemini process 7 billion tokens per minute via direct customer API usage alone.

At OpenAI, Altman reported processing over 6 billion tokens per minute on the API as of October 2025, with 4 million developers building on the platform. OpenAI's API traffic doubled within 48 hours following GPT-5's launch, reaching compute-constrained capacity. Altman stated unambiguously: "Most of what we're building out at this point is the inference."

Inference Cost Declines: The Jevons Paradox Engine

The Data: A 150× Cost Decline in 30 Months

The cost of inference at GPT-4-class quality has fallen at a rate that exceeds almost any precedent in technology — faster than PC compute, faster than bandwidth during the dotcom era, and faster than cloud storage.

Exhibit 2OpenAI API pricing history at GPT-4-class quality

OpenAI API pricing history at GPT-4-class quality.

Exhibit 3Cost per million output tokens at GPT-4-class quality: a 150× decline in ~30 months. Source: OpenAI published API pricing

Cost per million output tokens at GPT-4-class quality: a 150× decline in ~30 months. Source: OpenAI published API pricing.

At the frontier tier (best available model), costs have also fallen sharply — GPT-5.2 at $1.75/$14 input/output represents roughly a 75% decline from GPT-4's launch pricing, while delivering dramatically superior reasoning. Anthropic's own pricing reflects the same trajectory: Claude Opus 4.5 costs $5/$25 per million tokens, down from Opus 4's $15/$75 — a 67% reduction generation-over-generation.

What the CEOs Say About Cost Dynamics

The leaders closest to inference economics universally frame cost declines as demand accelerants, not threats to revenue:

On inference, we have typically seen more than 2× price-performance gain for every hardware generation, and more than 10× for every model generation due to software optimizations. … When token prices fall, inference computing prices fall; that means people can consume more. And there will be more apps written.

— Satya Nadella, Microsoft Q2 FY2025 earnings / November 2025 interview

We were able to lower Gemini serving unit costs by 78% over 2025 through model optimizations, efficiency and utilization improvements.

— Sundar Pichai, Alphabet Q4 2025 earnings

Yet despite this 78% cost reduction, Google's token throughput grew 134× in 18 months, Cloud AI revenue grew 400% YoY, and Pichai guided 2026 CapEx to $175–185 billion. The volume effect overwhelmed the price effect completely.

Why This Makes Inference Demand Insatiable

The relationship between inference cost and inference demand is not linear — it is hyperelastic. This is Jevons Paradox at industrial scale:

Price drops unlock new use cases. At $60/MTok, only high-value enterprise queries justified API calls. At $0.40/MTok, background agents running continuously become economical. The addressable market expands by orders of magnitude.

Volume overwhelms price. Google cut serving costs 78% but token volume grew 50× in the same period. OpenAI has slashed API prices by 90%+ since GPT-4, yet revenue grew from $1B to $13B+. Anthropic's inference costs exceeded projections by 23% — because demand grew faster than efficiency gains.

The cost floor is nowhere in sight. Nadella says Microsoft sees 10× software optimization per model generation, with hardware adding another 2× per generation. Amodei says optimization is "just beginning." Google just launched tiered inference pricing (Flex at 50% discount for latency-tolerant workloads) — indicating further segmentation ahead.

Cheaper inference creates compound demand. Each price reduction enables a new class of application (agents, background processing, always-on copilots) that generates 10–100× more tokens per user session than a simple chat query. The demand created by cheaper inference dwarfs the demand it replaces.

User & Developer Adoption: The Demand Surface Is Expanding

Inference demand scales with the number of users and the depth of their engagement. ChatGPT tripled its weekly active users from 300 million (Dec '24) to 900 million (Dec '25). Users send roughly 2.5 billion prompts per day. Meta AI crossed 1 billion monthly active users in Q1 2025 — faster than any AI product in history. The Gemini app reached 650 million monthly active users by Q3 2025, with queries tripling from Q2.

Exhibit 4ChatGPT weekly active users, Dec 2024–Dec 2025: 3× growth in 12 months

Source: Altman, Lightcap, OpenAI DevDay '25.

Developer adoption — the leading indicator for enterprise inference — is surging in parallel. Google reported 7 million developers building with Gemini (5× YoY), with Vertex AI usage up 40×. OpenAI counts 4 million developers. Anthropic's Claude Code hit $2.5 billion in annualized revenue by February 2026. Dario Amodei described coding as the "strongest leading indicator of the AI capabilities explosion." Every developer building an AI-powered application creates a persistent inference workload that scales with their user base.

Enterprise Adoption: The Inference Multiplier

Enterprise and developer usage is the critical vector for inference inflection. Unlike consumer chatbot interactions, enterprise workloads are persistent, high-volume, and growing in complexity. The data from frontier labs and hyperscalers is unambiguous: enterprise adoption is accelerating, not plateauing.

Frontier Lab Enterprise Signals

Exhibit 5Enterprise adoption metrics across frontier labs and hyperscalers

Enterprise adoption metrics across frontier labs and hyperscalers.

At OpenAI, enterprise seats grew 9× year-over-year, weekly enterprise messages increased 8×, and the company now counts over 3 million paying business users and 1 million business customers. Usage of Custom GPTs and Projects increased 19× in 2025. Altman stated in late 2025 that the API business grew faster than ChatGPT consumer — a structural shift toward inference-heavy developer and enterprise workloads.

At Anthropic, where roughly 80% of revenue comes from businesses via the API, annualized revenue grew from $1 billion in January 2025 to $19 billion by March 2026 — overwhelmingly driven by enterprise API consumption. The number of customers spending $100K+ annually grew 7× in a year; over 500 now spend $1M+ annually. Eight of the Fortune 10 are Claude customers. Amodei described Anthropic as "the fastest growing software company in history at the scale that it's at."

Hyperscaler Enterprise Signals

Google Cloud revenue grew 48% in Q4 2025 to a $70B+ run rate, with backlog surging 55% quarter-over-quarter to $240 billion. Revenue from products built on Google's generative AI models grew nearly 400% year-over-year in Q4. In December 2025 alone, nearly 350 Google Cloud customers each processed more than 100 billion tokens. Pichai reduced Gemini serving unit costs by 78% in 2025, yet total demand still overwhelmed capacity.

AWS grew 24% in Q4 to $35.6 billion in quarterly revenue ($142B run rate), with Jassy describing AI capacity as being monetized "as fast as we can install it." Trainium chips represent a $10B+ run-rate business with over 100,000 companies using them. Microsoft reported Azure growth of 39%, with CFO Amy Hood noting that customer demand "continues to exceed our supply" despite $37.5 billion in quarterly CapEx. 80% of CIOs surveyed plan to adopt Copilot within 12 months.

V. Revenue Trajectories: Inference Monetization at Scale

Revenue at frontier labs is the most direct monetization proxy for inference consumption. The trajectories are extraordinary.

Exhibit 6Anthropic annualized recurring revenue, Jan 2025–Mar 2026

~80% from enterprise API. Source: Amodei, Reuters, Sacra, Goldman Sachs.

Exhibit 7OpenAI revenue trajectory, 2023–2027E

$OpenAI Revenue$

Source: Altman (BG2 Pod), OpenAI disclosures.

OpenAI grew from $1B (2023) to $3.7B (2024) to an estimated $13B+ (2025), with Altman suggesting $100B revenue by 2027 on the BG2 Podcast. Anthropic grew revenue 10× every year: zero to $100M (2023), $1B (2024), and reaching $19B ARR by March 2026 — a trajectory Sacra estimates at 1,167% year-over-year growth.

Critically, most of Anthropic's revenue comes from API inference — businesses paying per-token for model outputs. This makes Anthropic's revenue curve nearly a pure proxy for enterprise inference demand growth. The company's 8- and 9-figure enterprise deals tripled in 2025 versus 2024, and average business customer spend grew 5×.

The convergence of evidence is striking in its unanimity. Every frontier lab CEO — Altman, Amodei, Pichai — reports that demand exceeds supply and that the current constraint is infrastructure, not willingness to pay. Every hyperscaler CEO — Nadella, Jassy, Zuckerberg — is doubling capital expenditure based on demand signals they describe as unprecedented.

Altman says he could double revenue overnight with double the compute. Jassy says capacity is monetized "as fast as we can install it." Google must double serving capacity every six months. The drivers ahead — agents, reasoning models, enterprise diffusion, price-driven demand expansion — are compounding, not additive.

Dario Amodei's analogy of rice on a chessboard captures it precisely: we are on roughly the 40th square, and the shocks from the first 39 squares combined are a fraction of what's ahead. Inference demand is not merely growing. It is approaching an inflection that the companies closest to it believe will reshape the economy.

VI. Demand Signals: Enterprise

The BCG CEO Data Point analysis of 6,027 company earnings calls in Q4 2025 provides a striking demand-side signal. Across the entire corporate landscape — not just tech — "AI" is the single most frequently mentioned keyword cluster and is still growing quarter-over-quarter. More telling are the high-growth, emerging clusters:

Exhibit 8CEO keyword mentions in Q4 2025 earnings calls: frequency (x-axis, log scale) vs. QoQ growth (y-axis). Source: BCG CEO Data Point Interactive, BCG analysis

CEO keyword mentions in Q4 2025 earnings calls: frequency (x-axis, log scale) vs. QoQ growth (y-axis). Source: BCG CEO Data Point Interactive, BCG analysis.

Industry-Specific Value Levers

There is a reason enterprise adoption of AI is accelerating despite the complexity and cost involved. The reason is not enthusiasm. It is not trend-following. It is the specific, quantifiable economics of operational improvement — the math that business unit owners, plant managers, and P&L leaders run every day.

Enterprises do not operate in abstractions. They operate inside very specific economic machines. Each machine has levers. Those levers have measurable output. And the operators who manage those machines know, with precision, what moving a lever is worth.

Exhibit 9Industry-specific AI value levers: quick-service restaurants, airlines, payments, and e-commerce

Industry-specific AI value levers: quick-service restaurants, airlines, payments, and e-commerce.

Exhibit 10Industry-specific AI value levers: semiconductors, healthcare, customer service, and enterprise software

Industry-specific AI value levers: semiconductors, healthcare, customer service, and enterprise software.

Every industry has a physics variable. A small delta produces nonlinear financial outcomes. When AI moves that variable — even slightly — the adoption decision is immediate.

And critically: every deployment in this table is a persistent inference workload. Visa's fraud detection runs on every transaction, 24/7/365. McDonald's AI processes orders during every shift. Klarna's assistant handles millions of conversations monthly. Salesforce's agents execute 3 billion workflows per month. None of this is a one-time computation. It is continuous, scaling inference consumption that persists for as long as the improvement persists — which is to say, indefinitely.

This is why enterprise AI adoption is not a matter of if. It is a matter of how fast. The economic incentive is too direct and too large for operators to ignore.

VII. Enterprise Consulting Signals

There is a category of market participant whose role in this transition is systematically underappreciated: the management consulting industry. Every major enterprise technology adoption in the last three decades — ERP, cloud, mobile — has followed an identical pattern. The technology emerges, early adopters demonstrate results, and then the consulting firms arrive to systematize the transition for the rest of the market. McKinsey, BCG, Deloitte, Accenture, and PwC do not merely measure enterprise adoption. They cause it.

They build the frameworks, staff the transformation offices, publish the benchmarks, and — critically — create the competitive pressure that forces laggards to move. Their involvement is itself a structural signal: when these firms are publishing data showing that 88% of organizations now use AI in at least one business function, that the share using it across five or more functions has grown 5× in four years, and that "future-built" companies are capturing 5.3× the revenue impact and 3.0× the cost reduction of laggards — those findings do not simply describe the market. They reshape it.

Exhibit 11Reported use of AI in at least one business function continues to increase, 2017–2025. Source: McKinsey Global Surveys on the state of AI

Reported use of AI in at least one business function continues to increase, 2017–2025. Source: McKinsey Global Surveys on the state of AI.

Exhibit 12Organizations are increasingly using AI in multiple functions

$McKinsey multi-function AI$

Source: McKinsey Global Surveys, 2021–25.

McKinsey's global survey of nearly 2,000 organizations shows adoption rising from 20% in 2017 to 88% in 2025, with gen AI usage alone surging from 33% to 79% in just two years — and critically, it is not shallow adoption. The share of companies deploying AI across three or more functions tripled from 17% to 51% since 2021, meaning inference workloads are multiplying within organizations, not just across them.

Exhibit 13Future-built companies create a virtuous cycle by higher spending on IT and reinvestment of gains from AI. Source: BCG Build for the Future 2025 Global Study (n=1,250)

Future-built companies create a virtuous cycle by higher spending on IT and reinvestment of gains from AI. Source: BCG Build for the Future 2025 Global Study (n=1,250).

Exhibit 14Share of companies by AI maturity tier (2024 vs

2025) and financial outperformance of future-built companies. Source: BCG Build for the Future 2025 Global Study (n=1,250).

BCG's Build for the Future study of 1,250 companies provides the ROI proof that sustains the flywheel: companies at the frontier of AI maturity deliver 1.7× the revenue growth, 3.6× the three-year total shareholder return, and 2.7× the return on invested capital — and they are reinvesting the gains, spending 120% more on AI than laggards, creating a compounding advantage that widens with each cycle.

Perhaps most significant for inference demand, BCG finds that the share of AI-driven value from agentic systems is expected to nearly double by 2028, with 46% of companies already experimenting with agents and 30% allocating more than 15% of their AI budgets to agentic workloads. Each agent deployed is not a single inference call — it is a persistent, autonomous workload that consumes tokens continuously.

VIII. The Enterprise Workload Opportunity

Andy Jassy, Amazon's CEO, offered a framing on the company's fourth quarter 2025 earnings call that deserves close attention. He described the current AI compute market as a barbell. On one end sit the frontier AI labs — OpenAI, Anthropic, Google DeepMind, xAI, Meta AI — spending enormous sums on training and running the largest models. On the other end sit the productivity and cost-avoidance workloads that enterprises are already extracting value from today: customer service automation, business process automation, fraud detection, document summarization. These are the use cases that justify the current wave of spending.

But the middle of the barbell, Jassy said, is where the real opportunity lies. That middle consists of all the enterprise production workloads — the millions of custom, internal, and legacy applications that actually run the world's businesses. Jassy was explicit: this middle portion "very well may end up being the largest and the most durable" source of AI compute demand.

— Andy Jassy, Amazon Q4 2025 earnings

This is the single most important demand signal in the entire AI infrastructure thesis, and it is the one that receives the least attention.

Exhibit 15Five AI capabilities that apply across all enterprise apps, and where the millions of private applications sit — along with how AI reaches them

$Enterprise AI capabilities$

Five AI capabilities that apply across all enterprise apps, and where the millions of private applications sit — along with how AI reaches them.

The Invisible Majority

To understand why the enterprise workload opportunity is so large, you first have to understand what enterprise applications actually are — and how many of them exist.

When most people think of enterprise software, they think of the names they recognize: Salesforce, SAP, Oracle ERP, Workday, ServiceNow. These are the core commercial platforms — the ERP systems, CRM platforms, and HR suites that form the operational backbone of large organizations. They are important, well-understood, and represent the visible tip of the enterprise software stack. But they account for a remarkably small fraction of the total.

Exhibit 16The iceberg: ~1,000 SaaS products above the waterline versus millions of private enterprise applications below. 85%+ of corporate workloads are internal. ~100× more private apps than SaaS. Only ~5% have any AI integration today

The iceberg: ~1,000 SaaS products above the waterline versus millions of private enterprise applications below. 85%+ of corporate workloads are internal. ~100× more private apps than SaaS. Only ~5% have any AI integration today.

Beneath that visible layer sits an enormous, largely invisible landscape of private enterprise applications. A Fortune 500 company might run SAP for financial accounting and Salesforce for customer relationship management — but it also runs thousands of other applications that no one outside the IT department has ever heard of. Custom internal tools built by business unit developers to solve specific operational problems. Legacy systems written in COBOL running on mainframes that still process the majority of the world's banking and insurance transactions. Supply chain orchestrators that coordinate logistics across dozens of warehouses and hundreds of suppliers. Compliance engines that enforce regulatory requirements and generate audit trails. Workflow automation systems that route approvals, manage document flows, and enforce business rules. Data pipelines that extract, transform, and load information between systems every night so that the executive dashboard is current by 7 AM.

~100×

More private apps than SaaS

85%+

Corporate workloads are internal

~5%

Have any AI integration today

The Amplification Dynamic

What makes this opportunity even larger than the raw application count suggests is the interconnected nature of these systems. Enterprise applications do not operate in isolation. They are deeply integrated with one another through APIs, middleware, data pipelines, and shared databases. An order entered in the CRM triggers an inventory check in the ERP, which triggers a shipment in the supply chain system, which triggers an invoice in the financial system, which triggers a revenue entry in the general ledger. One human action cascades through four or more systems automatically.

This interconnection creates an amplification dynamic for AI adoption. When AI is embedded into one application — say, an intelligent agent that automates purchase order approvals — that change reverberates through every system connected to it. The approval triggers downstream procurement workflows, which trigger supplier communications, which trigger inventory updates, which trigger financial postings. The AI does not just improve one step. It accelerates an entire chain.

This amplification effect means that the compute demand from embedding AI in enterprise applications is not proportional to the number of applications. It is proportional to the number of connections between them. In a large enterprise, that connectivity graph is enormous — and it means that the inference compute required to serve AI across enterprise workloads will scale nonlinearly as adoption deepens.

The Proprietary Data Unlock

There is a second dimension to the enterprise opportunity that is equally important and even less widely appreciated: the value of proprietary data.

Frontier AI models are trained on publicly available data — books, websites, code repositories, academic papers. They are remarkably capable across general domains. But they have never seen your company's internal data. They do not know your customer histories, your transaction patterns, your supply chain configurations, your compliance records, your internal communications, or the business logic embedded in your proprietary systems. This data is the actual intellectual capital of the enterprise — and it is where AI creates value that no publicly available model can replicate.

When an AI model is given access to a company's proprietary data — either through fine-tuning, retrieval-augmented generation, or direct integration with enterprise systems — it can perform tasks that are impossible for a general-purpose model. It can analyze a hundred-page contract in the context of the company's specific regulatory obligations and prior deal terms. It can forecast demand using the company's own historical patterns, not generic industry benchmarks.

The implication for compute demand is direct. Every enterprise that deploys AI against its proprietary data creates a persistent, ongoing inference workload — queries, summaries, analyses, and agent actions that run continuously against the company's own systems. This workload does not exist today in any meaningful scale. It is entirely new compute demand, generated by a use case that was not possible before frontier AI models existed. And it is as durable as the enterprise itself, because the proprietary data that makes it valuable is, by definition, something no competitor can replicate.

The Developer Strategy: The Hidden Gem

Of all the AI use cases gaining traction in the enterprise, one stands apart — not for its visibility, but for its strategic depth. That use case is coding. The popular narrative frames AI coding tools as productivity boosters. That framing dramatically understates what is actually happening.

There are approximately 28 million software developers worldwide, and developer compensation represents 50 to 70 percent of enterprise IT budgets. These developers are the universal bottleneck: every ERP customization, every pipeline fix, every legacy wrapper, every compliance update requires a developer to build, change, or maintain. Enterprise IT departments typically carry a 12 to 24 month backlog of requested work. The constraint on enterprise modernization is not ideas, strategy, or budget. It is developer hours. Stack Overflow's 2025 survey found that 84% of developers now use AI tools in their workflow, with 51% using AI daily.

The products have arrived — and the revenue curves are unlike anything in enterprise software history. Claude Code went from zero to $2.5 billion in annualized revenue in nine months, faster than ChatGPT itself. Business subscriptions quadrupled since January 2026. Four percent of all public GitHub commits are now authored by Claude Code, with projections exceeding 20% by year-end. OpenAI's Codex surpassed 2 million weekly active users by March 2026, with usage up fivefold since January, deployed at Cisco, Nvidia, Ramp, and Harvey. GitHub Copilot reached 4.7 million paid subscribers. Google's Antigravity hit 1.5 million weekly active users. As Nadella noted, the AI coding agent market went from roughly $500 million in run-rate revenue at the end of 2024 — essentially just GitHub Copilot — to $5–6 billion across all products by Q4 2025. Ten-times growth in a single year.

AI coding tools do not just write code. They read it. They gain visibility into the very structure of how enterprises operate — the workflows, data schemas, integration points, and business logic encoded in millions of lines of code. This is why the frontier labs are targeting coding so aggressively. It is not a feature. It is a channel into the operational core of every enterprise on Earth.

And every line of code written, reviewed, or analyzed by an AI model — including Codex sessions that run autonomously for 7+ hours on a single task — consumes inference compute continuously, at scale, across millions of developers, for as long as software continues to be built and maintained. Which is to say, indefinitely.

The Composite Picture

Step back and see the full landscape. Millions of private enterprise applications, each one a candidate for AI integration. An amplification dynamic that makes the compute demand nonlinear. Proprietary data that creates value no public model can replicate. Operators who think in KPIs and can quantify the ROI of every improvement. An economic transition that is moving IT from an enabling layer to an activity layer. A developer strategy that provides both the channel for adoption and the acceleration of deployment. And a consulting ecosystem that is systematically converting enterprise interest into enterprise action.

This is the middle of the barbell. It is barely getting started. And it is where the long-duration, recurring, compounding inference compute demand will come from — the demand that transforms compute infrastructure from a technology investment into an essential utility.

The customers are real. The incentives are quantified. The adoption is underway. The remaining question — the one this entire thesis is building toward — is who will supply the compute.

The Inference
Inflection

I. From Training to Inference

The Escalating Cost of Intelligence

The Inference Cost Escalation

II. The Three Modalities of Inference Demand

A. Chat: The Floor

B. API: The Multiplier

C. Agents: The Step Function

The Arithmetic of Inference Escalation

III. Drivers of the Inflection: Why Demand Compounds

Agents & Autonomous Workloads

Reasoning & Thinking Models

Model Capability Overhang

Price Declines Drive Demand Expansion (Jevons Paradox)

The Enterprise Diffusion Curve Is Early

IV. Demand Signals: Frontier Model Labs

Token Volumes: The Most Direct Measure of Inference

Inference Cost Declines: The Jevons Paradox Engine

The Data: A 150× Cost Decline in 30 Months

What the CEOs Say About Cost Dynamics

Why This Makes Inference Demand Insatiable

The Empirical Record

User & Developer Adoption: The Demand Surface Is Expanding

Enterprise Adoption: The Inference Multiplier

Frontier Lab Enterprise Signals

Hyperscaler Enterprise Signals

V. Revenue Trajectories: Inference Monetization at Scale

VI. Demand Signals: Enterprise

Industry-Specific Value Levers

VII. Enterprise Consulting Signals

The Consulting Flywheel

VIII. The Enterprise Workload Opportunity

The Invisible Majority

The Amplification Dynamic

The Proprietary Data Unlock

The Developer Strategy: The Hidden Gem

The Composite Picture

Sections

The InferenceInflection

I. From Training to Inference

The Escalating Cost of Intelligence

The Inference Cost Escalation

II. The Three Modalities of Inference Demand

A. Chat: The Floor

B. API: The Multiplier

C. Agents: The Step Function

The Arithmetic of Inference Escalation

III. Drivers of the Inflection: Why Demand Compounds

Agents & Autonomous Workloads

Reasoning & Thinking Models

Model Capability Overhang

Price Declines Drive Demand Expansion (Jevons Paradox)

The Enterprise Diffusion Curve Is Early

IV. Demand Signals: Frontier Model Labs

Token Volumes: The Most Direct Measure of Inference

Inference Cost Declines: The Jevons Paradox Engine

The Data: A 150× Cost Decline in 30 Months

What the CEOs Say About Cost Dynamics

Why This Makes Inference Demand Insatiable

The Empirical Record

User & Developer Adoption: The Demand Surface Is Expanding

Enterprise Adoption: The Inference Multiplier

Frontier Lab Enterprise Signals

Hyperscaler Enterprise Signals

V. Revenue Trajectories: Inference Monetization at Scale

VI. Demand Signals: Enterprise

Industry-Specific Value Levers

VII. Enterprise Consulting Signals

The Consulting Flywheel

VIII. The Enterprise Workload Opportunity

The Invisible Majority

The Amplification Dynamic

The Proprietary Data Unlock

The Developer Strategy: The Hidden Gem

The Composite Picture

The Inference
Inflection