- An LLM is a mathematical function defined by billions of parameters (dials). GPT-5 is widely reported to have over a trillion. Each dial must be precisely set; random settings produce gibberish.
- The model converts tokens into high-dimensional vectors (embeddings), where meaning is encoded as geometry — distances and directions between vectors capture relationships between concepts.
- Every output emerges from a single repeated operation: predict the next token, append it, predict again. Each prediction requires a forward pass through the entire model structure.
- The volume of data is enormous, the number of parameters is enormous, and the computational effort to shape those parameters is, by necessity, enormous.
The argument establishes a necessary relationship, not a contingent one. The capability of an LLM is not a feature that can be added cheaply on top of a simple system. It is an emergent property of the precise configuration of billions of parameters, which can only be achieved through a massive computational process. The richer and more precise the geometric representation of language (the embedding space), the more capable the model — and richer geometry requires more parameters, which requires more compute to tune. This is not a design choice that could have gone differently. It is a consequence of what these systems are.
- Each training step requires two full passes through a structure of hundreds of billions of mathematical operations (forward pass + backpropagation), repeated trillions of times.
- GPT-3 (175B parameters, 300B tokens) required ~3.6 × 10²³ floating-point operations. A person doing one operation per second would need over eleven quadrillion years. A modern laptop CPU would take over a hundred thousand years.
- Training clusters consist of thousands of GPUs networked together, executing an uninterrupted campaign of computation running 24/7 for weeks or months.
- Training runs cannot be casually paused. GPUs must remain continuously operational and tightly synchronized. Even a meaningful number of GPU failures mid-run can set the process back days or force a restart from a checkpoint.
- GPT-3’s training run consumed millions of dollars in compute alone — and GPT-3 is not even close to the current frontier.
The scale of computation is not an engineering choice — it is dictated by the mathematics. A forward pass through a trillion-parameter model is a fixed cost per training step. Backpropagation doubles it. Trillions of steps multiply it. No algorithmic shortcut eliminates the fundamental requirement: you must traverse the entire model, in both directions, trillions of times. GPUs became the defining hardware of the AI era because they can execute hundreds of trillions of operations per second in parallel — yet even thousands of them, working together for months, barely suffice for a single frontier training run. The physical infrastructure (power, cooling, networking, facility) is not overhead. It is the prerequisite.
- Pre-transformer approaches faced a fundamental tension: RNNs could see context but processed sequentially (impossible to parallelize); CNNs could parallelize but only captured local patterns.
- The 2017 paper “Attention Is All You Need” introduced the transformer, which computes attention scores between every pair of tokens simultaneously — across the entire sequence, in parallel.
- The attention mechanism converts language processing into massive matrix multiplications, which is precisely what GPU architectures are optimized for.
- Purpose-built accelerators — Google TPUs, AWS Trainium and Inferentia, and a growing class of custom ASICs — have emerged around the same parallel matrix workload, but GPUs remain the dominant, most broadly programmable, and most widely deployed hardware across labs, clouds, and enterprises for both training and inference.
- Every major LLM since — GPT, Claude, Gemini, Llama, Grok — is built on the transformer architecture.
- The architectural insight enabled training models with hundreds of billions of parameters on trillions of tokens in weeks rather than decades.
The transformer’s significance is not that it invented attention or backpropagation — both predate it. Its significance is that it provided an architecture where both operations could be executed with extraordinary parallelism. Previous architectures had a fundamental ceiling: sequential processing meant you could not throw more hardware at the problem and expect proportional speedup. The transformer removed that ceiling. By converting the core computation into parallel matrix operations, it made the problem hardware-solvable — and GPUs were the hardware that solved it. This alignment between the mathematical structure of the transformer and the physical architecture of GPUs is the single most important enabling condition for everything that followed.
- OpenAI’s January 2020 paper “Scaling Laws for Neural Language Models” demonstrated with rigorous empirical evidence that performance improves predictably with three variables: parameters, data, and compute.
- Capabilities that did not exist at one scale — summarization, reasoning, code generation, multilingual fluency — emerged at the next. This was discovered empirically.
- The scaling laws paper gave the field an explicit roadmap: if you want a more capable model, spend more on compute.
- Amodei (Feb 2026): “I don’t see issues with scaling laws continuing.” Hassabis (Dec 2025): “The scaling of the current systems, we must push that to the maximum.” Altman (Apr 2024): “We are not near the top of this curve.”
- Training compute has grown at ~4.4× per year since 2010. Each generation of model confirms the relationship still holds.
Scaling laws transform AI from a research gamble into an engineering problem. When you know, with empirical certainty, that more compute produces better results, the decision framework becomes straightforward: invest in compute. The fact that this relationship has survived across multiple labs working independently, multiple model architectures, and the addition of entirely new training paradigms (reinforcement learning) makes it one of the most robust empirical findings in modern technology. It has been tested and confirmed at every scale from millions to trillions of parameters. The CEOs of the three leading frontier labs — OpenAI, Anthropic, Google DeepMind — all state publicly that they see no evidence of the relationship breaking down.
- ChatGPT reached 1 million users in 5 days and 100 million in 2 months — the fastest-growing consumer application in history.
- Every component of ChatGPT had been published openly: the 2017 transformer paper, GPT (2018), GPT-2 (2019), GPT-3 (2020), the scaling laws paper (2020). The technology was hiding in plain sight for five years.
- The entire progression from GPT to GPT-2 to GPT-3 to ChatGPT followed the scaling laws: each increase in scale produced qualitatively better capabilities (summarization, reasoning, code generation, multilingual fluency).
- ChatGPT demonstrated that a transformer-based model, trained at enormous scale, fine-tuned for conversation, and released for public use, could perform tasks across virtually every domain of human knowledge.
ChatGPT’s importance is not technical but economic and strategic. It established two conditions simultaneously: (1) that transformer-based models trained at scale produce capability with genuine commercial value, and (2) that this value can be captured at massive scale in a consumer and enterprise market. Once both conditions hold, the economic logic becomes inescapable for every technology company, sovereign government, and institutional capital allocator. The speed of adoption — faster than any technology product in history — eliminated any remaining doubt about the demand side. The question was no longer whether AI capability would be valued, but how much compute would be needed to supply it.
- Microsoft extended its OpenAI investment past $13 billion and wove the models into Azure, Bing, and Microsoft 365, with continued participation in subsequent OpenAI rounds. Google merged DeepMind with Google Brain in April 2023 and began reorienting the entire company around AI-first infrastructure, from Gemini to custom TPUs. These are capex tied to core revenue engines.
- Amazon committed an initial $8 billion to Anthropic to anchor a frontier-model relationship for AWS, and in April 2026 agreed to invest up to another $25 billion alongside a $100 billion ten-year AWS commitment from Anthropic. Amazon separately anchored OpenAI’s April 2026 $122 billion round with a $50 billion investment — placing Amazon across both frontier labs at unprecedented scale.
- By April 2026, OpenAI had closed a $122B round at an $852B post-money valuation (Amazon $50B, Nvidia $30B, SoftBank $30B). Anthropic raised $30B in February 2026 at a $380B valuation and was fielding mid-April offers implying a valuation near $800B, on an annualized revenue run-rate above $19B. The Stargate Project — $500B, 10 GW — had moved from announcement to partial execution, with a flagship campus in Abilene, Texas and additional U.S. and international sites underway.
- Sovereign wealth funds and national governments entered alongside corporate capital, reflecting the strategic dimension: falling behind in AI is a national security risk.
- The competitive logic is self-reinforcing: scaling laws guarantee returns to compute, competitive pressure guarantees investment, and each investment raises the bar for all competitors.
The defining characteristic of this capital formation is that it is structurally compelled, not discretionary. Scaling laws established an empirical fact (more compute = better models). ChatGPT proved that capability = commercial value. Once both conditions hold, every major participant faces the same inescapable logic: falling behind in AI compute infrastructure is not a missed opportunity but an existential competitive risk. This is true whether the entity is a corporation defending market position or a nation-state securing economic and military advantage. No single actor can rationally choose to stop, because stopping means falling behind. The capital commitments are not bets on a trend. They are self-reinforcing obligations created by competitive necessity.
- Epoch AI data: training compute for frontier models growing at ~4.4× per year since 2010, meaning ~10,000× more compute every five years.
- Amodei (Nov 2024, Lex Fridman): “Today’s models cost of order $100 million. Models in training now are closer to $1 billion. In 2025–2026, we’ll get to $5 or $10 billion.”
- Amodei (Feb 2026, Dwarkesh Patel): By 2027, frontier labs will have ambitions to build $100 billion training clusters.
- OpenAI committed to $600 billion in compute spending by 2030, with revenue projections of $280 billion annually to justify it.
- Training run durations growing at ~1.26× per year, meaning the majority of compute scaling comes from larger clusters (more GPUs in parallel), not longer runs.
Each order-of-magnitude jump in training compute is not absorbed by any single lever. Each new GPU generation delivers meaningful per-chip throughput gains, and software efficiency improvements — better kernels, optimized attention implementations, mixed precision, improved parallelism strategies — compound on top. But these gains fall well short of the compute growth rate the frontier demands. The residual has to come from more GPUs and, to a lesser extent, longer runs, and the data shows scaling has been predominantly horizontal: assembling ever-larger clusters rather than running longer. Each generation of frontier model therefore requires multiples more GPUs, deployed in a physically larger installation, with proportional power, cooling, networking, and facility capacity. The cost escalation is not a speculation about what might be needed. It is a description of what Epoch AI has measured, what the lab CEOs have stated publicly under their own names, and what the financial commitments already reflect.
- OpenAI’s o1/o3 reasoning models, Anthropic’s extended thinking, and Google DeepMind’s reasoning models all use RL post-training on top of large-scale pre-training.
- Amodei (Jan 2025, DeepSeek essay): “From 2020–2023, the main thing being scaled was pretrained models. In 2024, reinforcement learning to train models to generate chains of thought has become a new focus of scaling.”
- Sergey Brin (Google I/O, May 2025): DeepMind’s AlphaGo work showed that RL combined with search could achieve what would take “5,000 times as much pre-training to match.” Applied to language models: “We’re just at the tip of the iceberg.”
- A frontier model now requires a large-scale pre-training run AND a large-scale RL post-training run. Both consume enormous compute. Total training compute per model has acquired a second multiplicative dimension.
Before 2024, the primary scaling dimension was pre-training: more data, more parameters, more compute on the forward/backward pass. RL post-training introduces a fundamentally different type of computation — the model generating its own reasoning traces, evaluating their quality, and refining its approach — that runs in addition to pre-training. This is not a more efficient replacement. It is a second, independent axis of compute demand that multiplies the first. Brin’s framing is instructive: if RL can achieve results that would require 5,000× more pre-training, the implication is not that RL is cheap but that the combined value (pre-training + RL) justifies enormous additional investment. Each lab is now scaling along two axes simultaneously.
- Multi-modal training: each new modality (text, images, video, audio, code, scientific data) requires additional compute. Gemini 3 is natively multi-modal. GPT-5 unified reasoning and non-reasoning. The surface area of training is expanding.
- Longer context and memory: models are being trained with context windows up to 1M+ tokens. Training on long-context data requires proportionally more compute per training example.
- Synthetic data and self-improvement: labs train models on data generated by other AI models. OpenAI used o1 to generate synthetic data for GPT-5. Google DeepMind’s AlphaEvolve uses AI to discover better algorithms. These recursive loops multiply compute.
- Multiple simultaneous training runs: Epoch AI found that the majority of OpenAI’s $5 billion in 2024 R&D compute went to experimental and unreleased models, not to final training runs of published models.
- The experimental compute — hundreds of smaller runs to find the right architecture and recipe — may exceed the final training run itself.
Each of these five drivers operates independently and compounds with the others. Multi-modal training expands the breadth of what must be learned. Longer context expands the depth of each training example. Synthetic data creates recursive loops where the output of one training run becomes the input to the next. And the distinction between “published” and “experimental” compute is critical: the visible models (GPT-5, Claude 4, Gemini 3) are the tip of an iceberg. The labs are running hundreds of experimental training runs simultaneously, each consuming significant compute, to find the right recipe for the next generation. The total training compute consumed by a frontier lab in a year is far larger than the compute consumed by its published models.
- Epoch AI Capabilities Index: scores rose from ~103 (early 2023) to over 155 (late 2025) across a standardized battery of reasoning, knowledge, math, coding, and language evaluations.
- GPT-4 (Mar 2023): 90th percentile on the bar exam, strong performance on graduate-level science, reasoning capabilities GPT-3.5 could not approach.
- Release cadence accelerated from years to months. By early 2026: GPT-5, Gemini 3 Pro, Claude 4, o3, Grok — each with measurable improvements.
- Capabilities compound: longer context enables more complex reasoning; better reasoning enables more reliable code generation; better code enables autonomous tool use. Each generation opens use cases that were impossible for the prior one.
- No single lab has maintained a durable lead — each new release is met within months by a comparable release from another lab, confirming the competitive dynamic drives continued scaling.
The skeptical objection to the scaling thesis is that more compute might stop producing better models — that the relationship could plateau. The Epoch AI index directly refutes this by showing a consistent upward trajectory across every measured dimension, at every generation, from every frontier lab. The compounding nature of capabilities is especially significant: improvements in one dimension (context length) unlock improvements in others (reasoning, code generation, tool use), creating a multiplicative effect where the combined capability of a new model far exceeds the sum of its individual improvements. The competitive dynamic ensures this is not one lab’s lucky streak — all labs are climbing the same capability curve independently.
- OpenAI’s stated mission: build AGI that benefits humanity. Anthropic’s founding thesis: ensure increasingly powerful AI remains safe as it approaches and exceeds human capability. Google DeepMind’s leadership: AGI timelines measured in years.
- Amodei (Feb 2026): “Making AI that is smarter than almost all humans at almost all things will require millions of chips, tens of billions of dollars, and is most likely to happen in 2026–2027.”
- Hassabis (Dec 2025): “The scaling of the current systems, we must push that to the maximum, because at the minimum, it will be a key component of the final AGI system. It could be the entirety of the AGI system.”
- Altman (Aug 2025): “You should expect OpenAI to spend trillions of dollars on data center construction in the not very distant future.”
- Today’s frontier models are remarkably capable but not AGI — they can draft legal arguments but cannot practice law, can write code but cannot architect novel systems from vague requirements. The gap between current capability and AGI ensures continued investment.
The argument is deliberately timeline-agnostic. It does not require you to believe AGI will arrive in 2027, or ever. It requires only two observations: (1) the frontier labs are pursuing AGI as their explicit objective, with the capital to act on that pursuit, and (2) the process of pursuing AGI — training larger models with more compute — produces commercially valuable models at every intermediate step. This means compute demand grows regardless of whether the ultimate goal is achieved. If AGI arrives in 2027, compute demand explodes. If AGI remains elusive, the labs still invest in the next generation because each generation produces better models that drive more revenue. The destination matters for civilization. It does not matter for compute demand. The journey is sufficient.
- Altman (Aug 2025): “If we didn’t pay for training, we’d be a very profitable company.” Training is the investment; inference is the business. But inference quality depends on training quality.
- Amodei (Feb 2026): Anthropic revenue growing 10× per year — $0 to $100M (2023), $100M to $1B (2024), $1B to $9–10B (2025). Revenue of this magnitude justifies proportional training investment.
- Altman described OpenAI’s posture as “calculated paranoia” — every competitive release triggers a “code red” that accelerates the next training run.
- The competitive equilibrium: everyone keeps scaling. A lab that stops scaling training falls behind on capability, which means falling behind on revenue, which means falling behind on the ability to fund the next run.
- OpenAI: 3× annual revenue growth. Anthropic: 10×. These trajectories validate the investment thesis at each generation and fund the next escalation.
This is the final synthesis of Arguments 1–11. The self-reinforcing loop operates at three levels simultaneously. Technically: scaling laws guarantee that more compute produces better models. Commercially: better models produce more revenue (validated by 3–10× annual growth at frontier labs). Competitively: each lab’s improvement raises the bar for all others, and falling behind on capability means falling behind on revenue, which means falling behind on the ability to invest. There is no rational exit: stopping is not “being conservative” but accepting decline. The arms race is not driven by optimism or hype. It is driven by the observed economics of a market where capability leadership produces extraordinary revenue, and capability leadership requires ever-larger compute investment.