Back to News & Commentary

Will Giant Companies Always Have a Monopoly on Top AI Models?

An ethernet cable plugging in to a computer-stylized globe
Part 2 of my dive into how likely LLMs are to either centralize or distribute power
An ethernet cable plugging in to a computer-stylized globe
Jay Stanley,
Senior Policy Analyst,
糖心VlogSpeech, Privacy, and Technology Project
Share This Page
August 20, 2025

In my post on large language models (LLMs) last week, I argued that the most important question about LLMs is not the outcome of a race with China or when AI will reach human-level intelligence, but whether this striking new technology will be accessible to and serve the interests of ordinary people, or whether it will end up centrally controlled by a small number of highly capitalized companies or governments.

One of the biggest factors in determining which future we face is the question: how easy it is to access the resources needed to build and train a cutting-edge or 鈥渇rontier鈥 LLM? Inspired by this question, I did a bit of a dive into the training process for LLMs to try to assess the outlook for free LLMs. To explain what I found, I鈥檒l need to look at the basics of how models are trained, as best I understand this fast-moving science from reading what experts in the field are saying.

A turnaround
As I mentioned in the prior post, things started off poorly. In 2022 and 2023, around the time OpenAI鈥檚 ChatGPT was released to the public, it looked like there was a real likelihood that this powerful new technology would be like nuclear power: centralized, complex, highly capital-intensive, secretive, and subject to strict security controls. The AI 鈥渂ase model鈥 that was the engine driving the ChatGPT application was called GPT-3, which was succeeded by GPT-4 in 2023. Training these models on the vast amounts of data available on the internet and elsewhere was an enormously expensive undertaking that few organizations could afford. The GPT-4 model over $100 million to train, including the acquisition of 25,000 computer graphics cards and a bill.

And it looked as though this level of resources was just the start. The seemingly miraculous performance of ChatGPT was largely the product of simply taking past research and scaling it up. LLMs that worked very poorly suddenly worked much better simply by exponentially increasing the amount of computing power (鈥渃ompute鈥) dedicated to their training. Even though the computations were relatively simple 鈥 involving predicting the next 鈥渢oken鈥 (roughly, word or symbol) in a text 鈥 when those computations were repeated trillions of times, unexpectedly smart behaviors appeared through a process known as 鈥渆mergence.鈥

Emergence
Emergence is a phenomenon in complexity science referring to the fact that large numbers of simple rules can upon repetition produce complex and surprising behaviors that don鈥檛 appear to be the predictable result of any characteristics evident in the rules. A classic example is the flocking behavior of birds. Computer programmers trying to recreate that behavior in a computer bird simulation could tie themselves in knots trying to manually direct V-formations and the beautiful merging and diverging of flocks, but it turns out that if each virtual bird is programmed with just a few simple rules (鈥渄on鈥檛 get too close to your neighbors, but steer toward their average heading and position鈥) a flock of simulated birds will behave in ways strikingly similar to the complex movements of real flocks. That complex behavior emerges out of the simple rules in ways that nobody could ever predict by looking at the rules.

There are over the precise meaning of 鈥渆mergence鈥 and the role it plays, but overall it seems to me that there鈥檚 no question that when AI systems are scaled up, they become capable of doing things that are surprising to people and far beyond anything they鈥檝e been explicitly programmed to do. And that power is not to be underestimated; AI progress may not soar the way boosters predict 鈥 but neither should the technology be reductively dismissed as a mere 鈥渨ord prediction machine鈥 or the like.

The original GPT-1 base model, released in 2018, had 117 million parameters 鈥 numbers that represent the strength of associations between different tokens, akin to synapses in the model鈥檚 鈥渂rain.鈥 The next model, GPT-2, had over 12 times as many, and GPT-3 had 116 times more than that, powering the chatbot that burst into fame as ChatGPT. GPT-4 had an estimated ten times more and performed far better. Simply programming these models to predict the next word in a sequence led to emergent behaviors that seemed surprisingly (and deceptively) intelligent and human in some respects, and soared far beyond what one might expect from a system that, at root, is trained to simply predict the next word in a text.

In 2022 the lesson seemed clear: exponentially scaling up these models was the secret to success. At the rate things were progressing, it seemed plausible to many that with just a few more exponential leaps we might reach human-level intelligence 鈥 what is commonly if vaguely referred to as 鈥渁rtificial general intelligence,鈥 or AGI. This fueled a 鈥淢anhattan Project鈥 conception of LLM research as a geopolitical 鈥渞ace鈥 toward a definitive goal: a sudden, secret breakthrough in reaching human or superhuman artificial intelligence. The winner of this race would obtain not a nuclear weapon but some sort of AI equivalent that would provide new levers for permanent dominance in business, the military, and the world. The implications of such a conception are bad for freedom: that research efforts should be concentrated and secretive, while cooperation and openness are foolish.

Chart: The evolution of OpenAI鈥檚 principal LLMs

But a funny thing happened on the way to AGI: the benefits from scaling up the base models appear to have reached a state of significantly . For several years after GPT-4, for example, the AI world was eagerly awaiting, and OpenAI eagerly promoting, GPT-5, but its release was repeatedly delayed until, in late February 2025 the company finally released GPT-4.5 (suspected by to be an expectations-lowering rename of GPT-5). It was not dramatically higher-performing. 鈥淕PT 4.5 cost about 100x the compute of GPT-4 to train,鈥 one expert , but 鈥渋t is only slightly better on normal user metrics. Scaling as a product differentiator died in 2024.鈥 Indeed, OpenAI鈥檚 competitors experienced a similar leveling off of progress in base-model training. The release of a model called GPT-5 in early August only this trend.

This was true even before the Chinese company DeepSeek made its enormous splash in December 2024 and January 2025, releasing models that achieved much more power at far lower training cost than previous models had been able to achieve. Some it to a company offering $50 smartphones more powerful than the latest $1,000 iPhone. This roiled stock valuations and was viewed largely through the lens of US-China geopolitical rivalries. The real significance of the DeepSeek innovations, however, was that it both clarified and accelerated the declining plausibility of the Manhattan Project model of LLM research.

The stages of training
As research continues, however, it is controversial and unclear to what extent the big science character of LLM training will continue to fade. The training of base models, where the returns to scale have apparently leveled off, at least for now, is only one step in creating LLMs. Meanwhile, other steps in the creation of finished models are being scaled up and absorbing more resources.

Overall, there are four basic stages in the creation of a model today, and they vary in what kinds of resources they require.

a. Data preparation
Access to data for training the base model is the first thing that any actor wanting to train an LLM will need. The data that will be used to train the model must be selected, gathered, and perhaps filtered. Typically this means enormous masses of raw textual data (what is often summarized as 鈥渢he entire Internet,鈥 though it can also include the texts of books, messages, social media posts, and other things that may not be online). Images and video are increasingly being used as well for so-called 鈥渕ulti-modal鈥 models that aim to understand images as well as text.

It鈥檚 not just big companies that can access all this data; there are a of open data sets that are available for anyone to use. The most prominent, perhaps, is the nonprofit organization 鈥檚 dataset, which includes regular snapshots of the entire public web gathered since 2008. Other publicly available datasets include the texts of books, Wikipedia, computer code repositories, and research papers. Experts say the biggest LLM players like OpenAI, Google, and Anthropic, reportedly do their own web crawling instead of just using such databases, and may have more resources to curate and filter the vast oceans of data that are poured into LLM pre-training. It鈥檚 not clear how much of an edge such work gives them in the quality of an LLM end product.

One of the reasons that the benefits of scaling up base-model training may have levelled off is that the amount of training data used has not scaled up commensurately with the number of parameters in the newest models at the frontier of research, according to . The internet and other prominent data sources have all been tapped by all the most prominent LLMs. But big companies that have exclusive access to other sources of data may as a result have an important leg up. Elon Musk鈥檚 model Grok, for example, has been trained on data from Twitter/X that nobody else can access. Other big LLM companies like OpenAI don鈥檛 have their own social media networks, and independent scientists certainly don鈥檛. Again it is unclear how much of an advantage access to that kind of proprietary data will prove to be over time.

b. Pre-training
Pre-training is the first step in actually creating a general LLM 鈥渂ase model鈥 (as opposed to smaller or more specialized models derived from it). It involves teaching the model the basics of language by running the vast datasets through thousands of powerful graphical processing unit (GPU) cards to teach it to predict the next token in a string of text, building in the process a map of associations between different words and concepts. The output of this training is a set of model weights 鈥 essentially, a large set of numbers (in the largest models, trillions) representing the strength of associations or 鈥渢hickness of the lines鈥 between different tokens (corresponding roughly to neurons in the model鈥檚 鈥渂rain鈥).

As we have seen, pre-training is yielding diminishing returns, but it remains the most expensive step that uses the vast majority of the compute involved in creating a model.

There鈥檚 another technical development that may reduce the barriers to entry to training models. Most base-model training has taken place in specialized server farms with expensive GPUs packed close together because of the need to transfer enormous amounts of data many times between GPUs and very quickly (an ability known as 鈥渋nterconnect鈥) at each training step. This was seen as so vital that U.S. export controls targeted at China didn鈥檛 attempt to restrict compute, but only interconnect, on the assumption that would hobble China鈥檚 ability to do AI. But advances are being made in distributed training that allow far-flung computers to accomplish the same tasks. One expert, Nathan Labenz, that distributed training of LLMs as good as recent frontier models 鈥渋s the kind of thing now that a well-organized but distributed group could probably potentially patch together the resources to do.鈥

c. Post-training
In the post-training stage, a base model鈥檚 abilities in a specific area are shaped and refined through a variety of techniques. Supervised fine-tuning (SFT) can be used to refine the model鈥檚 abilities by providing it with a cultivated set of examples in a specialized area 鈥 for example, if you want the model to write about finance, health care, or the law, you might fine-tune it with data from those specialties. Another technique is 鈥渞einforcement learning from human feedback鈥 (RLHF), in which humans give a model鈥檚 outputs thumbs ups and downs to nudge it toward behaving in certain ways and not others. A model can also be trained with synthetic data, feedback from another model, internal model self-critiques, or in some areas from self-play in which a model competes against itself.

Post-training is becoming an part of model building today. Several years ago it was mostly focused on style and safety, but its applications have gotten much broader. Post-training techniques are now used to shape a base model into a variety of derivative models. It could not only be trained to answer questions in a helpful and engaging manner for a chatbot, but also to actively search and retrieve new information (for 鈥渞etrieval-augmented generation,鈥 or RAG), to carry out tasks (for agents), to include images (for language vision models), or to specialize in answering objective scientific questions (for ).

Experts say that growth in the importance of post-training will likely continue. As Nathan Lambert of the nonprofit Allen AI research institute , 鈥渋t鈥檚 very logical that post training will be the next domain for scaling model compute and performance鈥 鈥 meaning that like post-training scaling while it lasted, ever-larger resources may be needed to stay on the post-training cutting edge. Lambert points out that post-training is 鈥渟till far cheaper than pretraining,鈥 but that 鈥減ost-training costs have been growing rapidly鈥 into the tens of millions of dollars.

Still, there are many organizations that can spend tens of millions of dollars, compared to the billions of dollars that many expected model training to eventually cost. In addition, experts Like Labenz that the lower technical difficulty of post-training makes it accessible to many more parties than base model training. 鈥淥ne of the biggest developments has been the recent revelation that reinforcement learning, on top of at least sufficiently powerful base models, really works and actually can be a pretty simple setup that works remarkably well,鈥 he says.

d. Inference
Inference-time compute is the processing that takes place after a user has made a query. When ChatGPT was first released, and for some time afterwards, LLMs did very little of this kind of processing, but the trend has been toward much more. Partly that is a result of reasoning models, which are created in post-training by giving LLMs large numbers of problems where there are objective right and wrong answers, a technique called reinforcement learning from verifiable rewards (RLVR). When training on problems with objective answers, such as in coding, science and math, models can be trained quickly and in great depth without human participation 鈥 teaching themselves, essentially. The researchers at Deepseek that one of the emergent (spontaneously emerging) behaviors produced by such training was chain-of-thought reasoning, in which the model explicitly 鈥渢hinks鈥 step-by-step about the query it has received, explaining its reasoning along the way, and backtracking if necessary, before arriving at the answer. Although this emerges from training on science and math arenas, it appears that it generalizes to other, more subjective domains as well, making the model do better at all kinds of queries, including such things as legal reasoning. Wherever deployed, chain-of-thought reasoning increases the inference-time compute and thus the cost of running a model.

Another development that adds to the costs of inference is a trend toward larger 鈥渃ontext windows.鈥 In many ways LLMs are like the protagonist of the movie Memento or a dementia patient who can access a lifetime鈥檚 worth of background memory and knowledge, but is unable to form new memories. In answering queries, LLMs always have at hand the world-knowledge that they gained during pre- and post-training, but this knowledge is frozen. In terms of new input 鈥 what they can keep in 鈥渕ind鈥 during a single conversation 鈥 they have rather small short-term memories.

The original ChatGPT is to have had a context window of 8,192 tokens. Increasing context windows is expensive because in preparing an answer, an LLM must compare every token in the window to every other token. That means that the amount of processing compute needed rises roughly by the square of the number of tokens in memory. Nevertheless, some models now have relatively enormous context windows, such as with 10 million tokens. This is in part due to clever innovations that are making it possible to reduce the compute needed to work with a window that big. But it鈥檚 still expensive. And even as inference becomes more compute-intensive, there is a strong demand for faster inference times, which is desirable for those who are want to use them for coding or for real-time applications like audio chat and live translation.

When inference costs are low, creating an LLM is like creating a railroad 鈥 it involves enormous upfront capital costs to build it, but then once built, relatively low marginal costs to run it. To the extent that inference costs grow, that raises ongoing operating costs. Providers are increasingly vying to offer large context windows and fast inference speeds 鈥 competing demands that are fast becoming major vectors of competition among LLM providers and reward computing power, scale, and centralization.

A diffusion of training ability
Overall, LLM research is in many ways 鈥渟preading outward鈥 compared to its initial Manhattan Project-like character, and becoming more broadly accessible, giving us reason to hope that 鈥 especially with active measures by policymakers 鈥 LLMs may just not become the latest center of growing corporate power over individuals. The picture is complicated and fast-changing, however. It鈥檚 hard to know what direction progress will come from and how accessible the data, compute, interconnect, and other resources required for such progress will be. But that is the thing to watch 鈥 and the thing that policymakers should be actively trying to affect.

In the next installment of this series, I鈥檒l look at another crucial factor in determining the future democratic character of LLMs: the state of open source LLM research and models.

Learn More About the 糖心Vlog on This Page