I. The Prediction Machine: How Large Language Models Work
Modern AI is defined by large language models (LLMs) — mathematical functions that take a sequence of text and predict what comes next, repeating that single act to produce everything from conversations to documents. Before any text enters the model, it’s broken into tokens: sub-word units like “under,” “stand,” and “ing” that allow the model to represent any language, including code and mathematical notation, using a fixed, manageable vocabulary. A token corresponds to roughly three-quarters of a word, so when GPT-4 is said to have a context window of 128,000 tokens, that means it can hold approximately 100,000 words in working memory at once — and every output it produces emerges from one repeated operation: predict the next token, append it, predict again.
Tokenization & Embedding
A model cannot do math on words. It cannot do math on tokens as raw text, either. To reason mathematically about language, the model must first translate each token into something a mathematical function can operate on: a list of numbers. In the field, this list of numbers is called a vector, and the process of converting a token into its vector is called embedding. You can think of each vector as a set of coordinates — not in our familiar three-dimensional space, but in a space with hundreds or even thousands of dimensions. In this high-dimensional space, tokens that are used in similar ways end up near each other. “King” and “queen” are close together. “Paris” and “France” are close together. “Python” the programming language and “Python” the snake are far apart, because the model has learned that they appear in entirely different contexts.
This is worth pausing on. The model does not know what these words mean in the way that you do. What it has done, through exposure to vast amounts of text, is discover a spatial arrangement for every token in its vocabulary such that the distances and directions between them capture the relationships between the concepts they represent. Meaning, in an LLM, is geometry. And the richer and more precise that geometry, the more capable the model.
Parameters & the Prediction Loop
The sequence of operations is this: raw text enters the model, gets broken into tokens, each token is converted into a high-dimensional vector, and then the model’s billions of internal parameters go to work — transforming those vectors, layer by layer, until the final output is a probability distribution over which token should come next. The model selects the most likely token, appends it, and the whole process repeats. This is how an LLM generates a sentence, a paragraph, or an entire essay.
Which brings us to the central question: what are those billions of internal parameters, and how do they get set? Think of each parameter as a single dial. A large language model has billions of these dials — GPT-5 is widely reported to have over a trillion. Each dial is set to a precise value, and the specific configuration of all of those dials, collectively, is what gives the model its capability. A model with randomly set dials produces gibberish. A model with correctly set dials produces what we’ve all witnessed over the past three years.
If the task is to internalize the structure of human language at this level — across domains, styles, and contexts — the scale required is immense. The volume of data is enormous. The number of parameters is enormous. And the computational effort required to shape those parameters into something useful is, by necessity, enormous. Every LLM you’ve ever interacted with is the product of billions of dollars of raw compute — and every more capable model that follows will demand more still.
II. The Compute Requirement: How Models Learn
With a shared understanding of what an LLM is — a massive mathematical function, defined by billions of precisely set parameters, that converts tokens into vectors and transforms those vectors into predictions — the natural question becomes: how do those parameters get set?
How do you go from a trillion randomly configured dials producing gibberish to a system that can draft legal briefs, debug code, and explain quantum mechanics?
The answer is training, and the core idea is surprisingly simple. You show the model a piece of text — say, the first half of a sentence from a book — and ask it to predict the next token. It produces a guess. You compare that guess to what the actual next token was. Then you nudge the dials, ever so slightly, in the direction that would have made the guess less wrong. And you repeat this process. Not thousands of times. Not millions of times. Trillions of times.
Forward & Backward Passes
Each repetition of this cycle is called a training step. In each step, the model sees real text, makes a prediction, gets a measured signal for how far off it was, and has its parameters adjusted accordingly. The adjustment to any individual dial in any single step is vanishingly small — a tiny fraction of a fraction. But compounded across trillions of steps, these micro-adjustments accumulate into something remarkable: the dials gradually settle into a configuration that captures the deep structure of human language and, by extension, the knowledge embedded within it.
Consider what a single training step actually involves. The model must take in a sequence of tokens, convert them into vectors, and pass those vectors through every layer of its internal structure — a structure defined by, say, trillions of parameters — to produce a prediction. That is one forward pass: a single traversal of the entire model. Then, having measured the error, the system must trace back through every layer, calculating how each parameter contributed to the mistake, and compute the appropriate adjustment for each one. That is a second traversal. So each training step requires, at minimum, two full passes through a structure containing hundreds of billions of mathematical operations. And this happens trillions of times.
Training at Scale: GPUs, Clusters, and the Cost of Intelligence
GPT-3, released in 2020, had 175 billion parameters and was trained on roughly 300 billion tokens of text, requiring approximately 3.6 × 10²³ floating-point operations — a quantity so vast that a person doing one operation per second would need over eleven quadrillion years to finish, and even a modern laptop CPU would take over a hundred thousand. This is why GPUs, originally designed for video game graphics and capable of a hundred trillion operations per second, became the defining hardware of the AI era — though even a single high-end GPU would need over a century to train GPT-3 alone.
The solution is parallelism: thousands of GPUs networked together in a data center, forming what the industry calls a training cluster, which executes an uninterrupted campaign of computation running around the clock for weeks or months as the model processes its trillions of tokens. These training runs cannot be casually paused; the GPUs must remain continuously operational, tightly synchronized, and fed with data and power, since even a meaningful number of GPU failures mid-run can set the process back days or force a restart from a checkpoint.
GPT-3’s training run consumed millions of dollars in compute alone, yet GPT-3 is not even close to the current frontier — it was, in effect, the proof of concept. The models that followed, including GPT-5, Claude, Gemini, Grok, and Llama, are widely understood to be an order of magnitude larger or more, with training data measured in trillions of tokens, parameter counts in the hundreds of billions to trillions, and compute requirements that strain the capacity of the largest data centers on Earth.
This is the landscape in which the industry now operates: models of extraordinary scale, trained on essentially all of human written output, requiring compute budgets that only a handful of organizations on the planet can finance — and as subsequent sections will demonstrate, the economics of this process are not stabilizing but accelerating.
III. The Long Road: Language AI Before the Transformer
The challenge that has occupied language AI researchers for decades is fundamentally one of representation: how to convert human language — messy, ambiguous, context-dependent — into numbers a computer can operate on, with every major era offering a different answer, each more powerful and each requiring substantially more compute.
The earliest approaches were crude but practical: bag-of-words models, widely used from the early 2000s, represented text by simply counting word frequencies, making them useful for search engines and spam filters but blind to word order and meaning.
A significant leap came in 2013 with Word2Vec, which learned to represent words as points in a continuous mathematical space by training a neural network to predict neighboring words — discovering on its own that “king” and “queen” should be close together, that “Paris” and “France” should share a relationship similar to “Berlin” and “Germany” — though it assigned each word a single fixed vector regardless of context.
Recurrent Neural Networks and Long Short-Term Memory networks attempted to solve this by processing language sequentially, carrying a running summary forward so that a word’s representation could finally be influenced by those before it, but sequential processing made them impossible to parallelize and painfully slow to train at scale, while context still degraded over long distances.
Convolutional Neural Networks, borrowed from image recognition, offered speed by examining multiple words simultaneously through fixed-size windows, but could only capture local patterns, leaving relationships between distant words out of reach.
By the mid-2010s, the field had arrived at a fundamental tension: models that understood language well were too slow to train on enough data, and models fast enough to train at scale could not understand language well enough.
That tension was resolved in 2017 with the introduction of a new architecture, but what makes the story remarkable is that it took five years of quietly compounding progress — researchers building ever-larger models, producing results that impressed specialists but barely registered with the broader world — before the true magnitude of the paradigm shift became undeniable in late November 2022.
IV. The Breakthrough: ChatGPT and the Transformer Revolution
On November 30, 2022, OpenAI released a free research preview of ChatGPT, which reached one million users within five days and a hundred million within two months — the fastest-growing consumer application in history — as people who had spent careers dismissing AI found themselves in extended, coherent conversations with a machine that could write essays, debug code, explain legal concepts, and compose poetry with a fluency that felt, for the first time, genuinely human.
But what most of those millions did not know was that everything making ChatGPT possible had been published five years earlier, in a 2017 Google paper titled “Attention Is All You Need,” which introduced the transformer architecture as a technical contribution to machine translation that made no headlines and reached no mainstream audience, yet would become the foundation upon which every major language model — GPT, Claude, Gemini, Llama, and ChatGPT itself — is built.
The Attention Mechanism
The transformer’s breakthrough lay in resolving the fundamental tension identified in the prior section — that models capable of seeing full context were too slow to train at scale, while models fast enough to scale could not see full context — through a single mechanism called attention.
The intuition is straightforward: when reading “The animal didn’t cross the street because it was too tired,” a human effortlessly understands that “it” refers to “the animal” by attending to the relevant earlier word, and the attention mechanism does exactly this mathematically, computing a score for every token against every other token to determine how much each should influence the other’s representation — simultaneously, across the entire sequence, and across multiple independent “heads” that each learn to track different types of relationships, whether grammatical references, semantic similarity, or positional proximity.
Transformer Architecture & Parallelism
The architectural insight that changed everything is captured in a single word: simultaneously. Unlike RNNs, which processed tokens one at a time, the transformer processes all tokens in parallel, computing every attention score between every pair of tokens at once — transforming the problem from a sequential chain into a massive matrix multiplication, which is precisely what GPUs were designed to do, making the alignment between the transformer’s computational structure and GPU architecture not a coincidence but the reason this particular breakthrough scaled.
With attention as its core, the transformer organizes its work in layers: each layer takes the full set of token vectors, applies attention to let every token gather information from every other, then passes the results through a feed-forward network that refines the representations further, with the output of one layer becoming the input to the next. A large transformer might have nearly a hundred such layers, each building more abstract and nuanced representations on top of the last, so that by the final layer the token vectors have been transformed from raw positional embeddings into rich, context-saturated representations encoding meaning, syntax, relationships, and domain knowledge — all learned from data.
The Scaling Laws Roadmap
The transformer’s contribution was not inventing the forward pass or backpropagation — both date back decades — but providing an architecture where both operations could be executed with extraordinary parallelism, making it possible for the first time to train models with hundreds of billions of parameters on trillions of tokens in weeks rather than decades.
What is remarkable, in retrospect, is that the entire path from the 2017 paper to ChatGPT was documented in public: GPT in 2018 demonstrated that a single pre-trained transformer could be fine-tuned for a wide range of tasks; GPT-2 in 2019 made headlines when OpenAI initially withheld the full model over misuse concerns; and GPT-3 in May 2020, with 175 billion parameters, delivered few-shot learning, code generation, and coherent long-form text.
But arguably the most consequential paper in the entire sequence arrived in January 2020, when OpenAI published “Scaling Laws for Neural Language Models,” demonstrating with rigorous empirical evidence that model performance improves predictably and smoothly as you increase three variables — parameter count, training data, and compute budget — giving the field an explicit roadmap: if you want a more capable model, spend more on compute. Every step was published openly, and the scaling laws paper alone should have been a fire alarm, yet for most of the world it took ChatGPT’s hundred-million-user debut to make the implications undeniable.
And this is exactly what happened in the five years between the paper and ChatGPT. Researchers at OpenAI, Google, and others began building progressively larger transformer-based models — GPT, GPT-2, GPT-3, BERT, T5 — each one scaling up the parameter count, the training data, and the compute budget. With each increase in scale, the models got meaningfully better. Not just incrementally better. Qualitatively better. Capabilities that did not exist at one scale — summarization, reasoning, code generation, multilingual fluency — would emerge at the next. This was not obvious in advance. It was discovered empirically, and it pointed toward a powerful conclusion: the more compute you invest in a transformer, the more capable it becomes, with no clear ceiling in sight.
ChatGPT was the moment this became visible to everyone. It was not a new scientific breakthrough. It was the proof of concept — a transformer-based model, trained at enormous scale on an enormous training cluster, fine-tuned to converse naturally with humans, and released for anyone to use. In a single product, it demonstrated what the transformer architecture, the attention mechanism, the GPU-driven training infrastructure, and five years of relentless scaling had collectively produced. The technology had been hiding in plain sight. ChatGPT simply made it impossible to look away.
V. The Arms Race: Capital, Competition, and the Logic of Escalation
The Capital Reallocation
The response to ChatGPT constituted the largest and most rapid reallocation of capital in the history of the technology industry. Before November 2022, the frontier AI landscape was remarkably contained: Google had acquired DeepMind in January 2014 for a reported $500+ million, Microsoft had placed a $1 billion investment in OpenAI in 2019, Meta had built a major research division under Yann LeCun, and Anthropic had been founded in 2021 by former OpenAI researchers with a focus on AI safety. The relationships were established and the work was underway, but it was still fundamentally a research endeavor. ChatGPT changed the calculus overnight, making clear that transformer-based language models were not a research curiosity but potentially the most consequential computing platform since the internet.
What followed was not a burst of enthusiasm that might fade — it was the formation of a durable industrial structure with self-reinforcing economic logic. Microsoft extended its OpenAI investment past $13 billion and wove the models into Azure, Bing, and Microsoft 365; Amazon committed an initial $8 billion to Anthropic to anchor a frontier-model relationship for AWS, then in April 2026 agreed to invest up to another $25 billion alongside a $100 billion ten-year AWS commitment from Anthropic; Google merged DeepMind with Google Brain in April 2023 and began reorienting the entire company around AI-first infrastructure, from Gemini to custom TPUs. These were not discretionary research budgets — they were capital expenditures tied to core revenue engines, the kind of spending that, once committed, creates its own momentum through competitive necessity.
By April 2026, OpenAI had closed a $122 billion round at an $852 billion post-money valuation, anchored by Amazon ($50B), Nvidia ($30B), and SoftBank ($30B), with continued participation from Microsoft and retail investors participating through bank channels for the first time. Anthropic had raised $30 billion in February 2026 at a $380 billion valuation and was, by mid-April, fielding secondary and primary offers that implied a valuation near $800 billion, on an annualized revenue run-rate that had reportedly climbed above $19 billion driven almost entirely by enterprise adoption of Claude. The Stargate Project — a $500 billion, 10-gigawatt infrastructure initiative announced at the White House in January 2025 with SoftBank, OpenAI, Oracle, and MGX — had moved from announcement to partial execution, with a flagship campus operating in Abilene, Texas, additional U.S. sites under construction in New Mexico, Ohio, and Milam County, and international buildouts underway in the UAE and Argentina. The threshold for frontier competition had shifted from billions to hundreds of billions, with sovereign wealth funds (GIC, the Qatar Investment Authority, MGX) and national governments entering alongside corporate capital.
The Self-Reinforcing Dynamic
The critical insight for the thesis is that this trajectory is not driven by hype but by structural economic forces that make continued escalation the rational choice for every participant. The scaling laws established an empirical fact: more compute predictably yields more capable models. ChatGPT then proved that capability converts directly into commercial and strategic value. Once both conditions hold simultaneously, every major technology company, every sovereign government, and every pool of institutional capital faces the same inescapable logic — falling behind in AI compute infrastructure is not a missed opportunity but an existential competitive risk.
This is the defining characteristic of an industry forming around training rather than a speculative cycle running its course: the capital commitments are not discretionary but structurally compelled, each participant’s investment raising the floor that every other participant must meet, with no mechanism by which any single actor can rationally choose to stop.
It is this self-reinforcing dynamic — scaling laws guaranteeing returns to compute, competitive pressure guaranteeing investment in compute, and investment in compute raising the bar for all competitors — that explains why spending has grown from millions to billions to hundreds of billions in five years, and why the trajectory points toward trillions.
VI. The Scaling Frontier: AGI, Superintelligence, and the Demand Trajectory
Every frontier lab, every hyperscaler partnership, every billion-dollar investment we have just described is oriented toward the same destination — even if they describe it in slightly different terms. The destination is artificial general intelligence, or AGI, and beyond it, what researchers call superintelligence. Understanding what these terms mean, and more importantly what their pursuit requires, is essential to understanding why compute demand is not a cycle. It is a trajectory.
Artificial general intelligence refers to an AI system that can perform any intellectual task that a human can, across any domain, without being specifically trained for that domain. Today’s frontier models are remarkably capable, but they are not AGI. They can draft legal arguments but cannot independently practice law. They can write code but cannot architect a novel software system from a vague business requirement the way a senior engineer can. AGI would close these gaps — not by memorizing more text, but by developing the kind of flexible, transferable reasoning that allows a single human mind to navigate law, medicine, engineering, and conversation without being retrained for each one.
Superintelligence goes further. It refers to an AI system that exceeds the best human performance across essentially all cognitive domains — scientific research, strategic reasoning, creative synthesis, mathematical proof. This is not science fiction to the people building these systems. OpenAI’s stated mission is to build AGI that benefits humanity. Anthropic’s founding thesis centers on ensuring that increasingly powerful AI systems remain safe and controllable. Google DeepMind’s leadership has spoken publicly about AGI timelines measured in years, not decades. These organizations are not hedging. They are building toward these goals with the full weight of their capital, talent, and compute.
Whether AGI or superintelligence is achieved in five years, fifteen years, or at all is a matter of genuine debate among researchers. But here is the point that matters for our analysis: the debate is irrelevant to the demand for compute. The frontier labs do not need to achieve AGI for their compute requirements to continue growing. They need only to continue doing what they have been doing — training larger models on more data with more compute — because every time they do, the models get meaningfully more capable. The scaling laws are not a theory. They are an empirical observation that has held consistently for years: invest more compute, get more capability. As long as that relationship holds, the incentive to scale is absolute.
The Empirical Record
This is not a speculative inference. It is visible in the data. The first chart below, compiled by Epoch AI, plots the training compute — measured in floating-point operations — used by notable AI models from 2010 through 2025. The vertical axis is logarithmic, meaning each gridline represents a tenfold increase. The trend line shows that the compute used to train frontier models has been growing at approximately 4.4 times per year.
Consider what that means in physical terms. Each order-of-magnitude increase in compute is absorbed partly by hardware improvements — each new GPU generation delivers meaningful per-chip throughput gains — and partly by algorithmic and software efficiency gains, including better kernels, optimized attention implementations, mixed precision, and improved parallelism strategies. These gains are substantial, but they fall well short of the compute growth rate the frontier demands. The residual has to come from more GPUs and, to a lesser extent, longer runs — together with the power, cooling, networking, and data center capacity to support them. The economics of that residual, and why scaling has been predominantly horizontal rather than temporal, are the subject of Thesis III.
The implication is straightforward. The frontier labs have a clearly defined objective — build increasingly capable AI systems in pursuit of AGI and ultimately superintelligence. They have an empirically validated method for making progress toward that objective — scale up compute. They have the capital to execute at the required scale. And they have been executing, consistently, for years.
By February 2026, speaking with Dwarkesh Patel, Amodei extended the projection further: by 2027, he anticipates frontier labs will have ambitions to build $100 billion training clusters. He was not speculating. He was describing the planning assumptions his own company and its peers are operating under.
Reinforcement Learning as a Second Scaling Axis
The single most important scientific fact driving training demand is that scaling laws continue to hold. More compute, applied to more data with the right techniques, produces measurably better models.
An important evolution occurred in 2024–2025. The original scaling paradigm — making models larger and training them on more data (pre-training) — was augmented by a second axis: reinforcement learning (RL) applied to chain-of-thought reasoning. This is the paradigm behind OpenAI’s o1/o3 reasoning models and similar work at Anthropic and Google DeepMind.
The significance for training demand is profound. Reinforcement learning does not replace pre-training. It is added on top of it. A frontier model now requires a large-scale pre-training run and a large-scale RL post-training run. Both consume enormous compute. The total training compute per model has not merely continued to grow — it has acquired a second multiplicative dimension.
Structural Drivers of Continued Scaling
Five specific factors ensure that training demand will continue to grow exponentially, not level off:
1. Multi-modal Training
Each new modality — text, images, video, audio, code, scientific data — requires additional training compute. Gemini 3 is natively multi-modal. GPT-5 unified reasoning and non-reasoning capabilities. The surface area of training is expanding, not contracting.
2. Reinforcement Learning at Scale
RL post-training is a second multiplicative dimension of compute demand. It does not replace pre-training; it compounds it.
3. Longer Context and Memory
Models are being trained with ever-longer context windows (up to 1M+ tokens). Training on long-context data requires proportionally more compute per training example.
4. Synthetic Data and Self-Improvement
Labs are increasingly training models on data generated by other AI models. OpenAI used o1 to generate synthetic data for GPT-5 training. Google DeepMind’s AlphaEvolve uses AI to discover better algorithms. These recursive training loops multiply compute demand because the output of one training run becomes the input to the next.
5. Multiple Simultaneous Training Runs
The frontier labs are not running a single training run at a time. Epoch AI’s analysis showed that OpenAI spent the majority of its $5 billion in 2024 R&D compute on experiments and unreleased models, not on the final training runs of published models. The experimental compute may exceed the final training run itself.
What the Labs Are Telling Us
The CEOs of the frontier labs are staking their reputations, their companies, and billions of dollars of investor capital on the proposition that training demand will continue to scale. Their statements are not hedged:
VII. Milestones: The Evidence That Scaling Delivers
The capital has been committed. The training clusters have been built. The scaling laws predict that more compute produces more capability. But predictions are not results. The question an investor must ask is: has the scaling actually delivered? The answer, documented in public benchmarks, published evaluations, and the observable behavior of the models themselves, is unambiguously yes. The progress since ChatGPT has not been incremental. It has been a sustained, measurable, and accelerating expansion of what AI systems can do.
The cadence alone tells a story. In March 2023, OpenAI released GPT-4, which represented a generational leap — scoring in the 90th percentile on the bar exam, demonstrating strong performance on graduate-level science questions, and exhibiting reasoning capabilities that GPT-3.5 could not approach. Within months, Google responded with Gemini, Anthropic released Claude 2, and Meta published Llama 2 as open source. By mid-2024, every frontier lab had released at least one major model update. Then the pace accelerated further — OpenAI introduced the o1 family, Google released Gemini 2.5 Pro, Anthropic shipped Claude 3.5 and Claude 4. By early 2026, GPT-5 and Gemini 3 Pro had arrived. The release cadence across the industry is now measured in months, not years, and each release brings measurable capability gains.
The Capability Staircase
The Epoch Capabilities Index, compiled by the independent research organization Epoch AI, captures this trajectory in a single chart. The index scores models across a standardized battery of evaluations — reasoning, knowledge, mathematics, coding, and language comprehension — and plots them by release date. What the chart reveals is a staircase: a consistent, upward march in aggregate capability from a score of roughly 103 in early 2023 to over 155 by late 2025. Every frontier lab is climbing. No single organization has maintained a durable lead — each new release from one lab is met within months by a comparable or superior release from another.
But aggregate scores, while useful, can obscure the specific dimensions along which these models have improved. The capability gains are not uniform — certain abilities have expanded dramatically while others have improved more gradually. The following table summarizes the most significant dimensions of improvement across frontier models since GPT-4’s release.
These improvements are not independent of one another. Longer context windows enable more complex reasoning. Better reasoning enables more reliable code generation. More reliable code generation enables autonomous tool use. The capabilities compound, and each generation of model opens use cases that were not merely difficult for the prior generation but genuinely impossible.
The Synthesis
Training demand is insatiable because the economics are self-reinforcing. Better models generate more revenue. More revenue justifies larger training runs. Larger training runs produce better models. The scaling laws that govern this loop have not broken despite repeated testing across multiple labs, multiple model generations, and multiple paradigms.
The competitive dynamics ensure that no single lab can slow down without ceding capability leadership. The revenue trajectories — 3× annual growth at OpenAI, 10× at Anthropic — validate the investment thesis at each generation. And the structural drivers (multi-modal training, RL scaling, longer context, synthetic data, experimental runs) ensure that the compute required per model continues to grow, even as the cost per unit of compute falls.
The demand is real. It is measured. It is growing. And it is the foundation on which everything else in this thesis — inference demand, data center economics, and Oracle’s compound opportunity — is built.