
We can’t not talk about power these days. We’ve been talking about it ever since the Stargate project, with half a trillion dollars in data center investment, was floated early in the year. We’ve been talking about it ever since the now-classic “Stochastic Parrots” paper. And, as time goes on, it only becomes more of an issue.
“Stochastic Parrots” deals with two issues: AI’s power consumption and the fundamental nature of generative AI; selecting sequences of words according to statistical patterns. I always wished those were two papers, because it would be easier to disagree about power and agree about parrots. For me, the power issue is something of a red herring—but increasingly, I see that it’s a red herring that isn’t going away because too many people with too much money want herrings; too many believe that a monopoly on power (or a monopoly on the ability to pay for power) is the route to dominance.
Why, in a better world than we currently live in, would the power issue be a red herring? There are several related reasons:
- I have always assumed that the first generation language models would be highly inefficient, and that over time, we’d develop more efficient algorithms.
- I have also assumed that the economics of language models would be similar to chip foundries or pharma factories: The first chip coming out of a foundry costs a few billion dollars, everything afterward is a penny apiece.
- I believe (now more than ever) that, long-term, we will settle on small models (70B parameters or less) that can run locally rather than giant models with trillions of parameters running in the cloud.
And I still believe those points are largely true. But that’s not sufficient. Let’s go through them one by one, starting with efficiency.
Better Algorithms
A few years ago, I saw a fair number of papers about more efficient models. I remember a lot of articles about pruning neural networks (eliminating nodes that contribute little to the result) and other techniques. Papers that address efficiency are still being published—most notably, DeepMind’s recent “Mixture-of-Recursions” paper—but they don’t seem to be as common. That’s just anecdata, and should perhaps be ignored. More to the point, DeepSeek shocked the world with their R1 model, which they claimed cost roughly 1/10 as much to train as the leading frontier models. A lot of commentary insisted that DeepSeek wasn’t being up front in their measurement of power consumption, but since then several other Chinese labs have released highly capable models, with no gigawatt data centers in sight. Even more recently, OpenAI has released gpt-oss in two sizes (120B and 30B), which were reportedly much less expensive to train. It’s not the first time this has happened—I’ve been told that the Soviet Union developed amazingly efficient data compression algorithms because their computers were a decade behind ours. Better algorithms can trump larger power bills, better CPUs, and more GPUs, if we let them.
What’s wrong with this picture? The picture is good, but much of the narrative is US-centric, and that distorts it. First, it’s distorted by our belief that bigger is always better: Look at our cars, our SUVs, our houses. We’re conditioned to believe that a model with a trillion parameters has to be better than a model with a mere 70B, right? That a model that cost a hundred million dollars to train has to be better than one that can be trained economically? That myth is deeply embedded in our psyche. Second, it’s distorted by economics. Bigger is better is a myth that would-be monopolists play on when they talk about the need for ever bigger data centers, preferably funded with tax dollars. It’s a convenient myth, because convincing would-be competitors that they need to spend billions on data centers is an effective way to have no competitors.
One area that hasn’t been sufficiently explored is extremely small models developed for specialized tasks. Drew Breunig writes about the tiny chess model in Stockfish, the world’s leading chess program: It’s small enough to run in an iPhone, and replaced a much larger general-purpose model. And it soundly defeated Claude Sonnet 3.5 and GPT-4o.1 He also writes about the 27 million parameter Hierarchical Reasoning Model (HRM) that has beaten models like Claude 3.7 on the ARC benchmark. Pete Warden’s Moonshine does real-time speech-to-text transcription in the browser—and is as good as any high-end model I’ve seen. None of these are general-purpose models. They won’t vibe code; they won’t write your blog posts. But they are extremely effective at what they do. And if AI is going to fulfill its destiny of “disappearing into the walls,” of becoming part of our everyday infrastructure, we will need very accurate, very specialized models. We will have to free ourselves of the myth that bigger is better.2
The Cost of Inference
The purpose of a model isn’t to be trained; it’s to do inference. This is a gross simplification, but part of training is doing inference trillions of times and adjusting the model’s billions of parameters to minimize error. A single request takes an extremely small fraction of the effort required to train a model. That fact leads directly to the economics of chip foundries: The ability to process the first prompt costs millions of dollars, but once they’re in production, processing a prompt costs fractions of a cent. Google has claimed that processing a typical text prompt to Gemini takes 0.24 watt-hours, significantly less than it takes to heat water for a cup of coffee. They also claim that increases in software efficiency have led to a 33x reduction in energy consumption over the past year.
That’s obviously not the entire story: Millions of people prompting ChatGPT adds up, as does usage of newer “reasoning” modules that have an extended internal dialog before arriving at a result. Likewise, driving to work rather than biking raises the global temperature a nanofraction of a degree—but when you multiply the nanofraction by billions of commuters, it’s a different story. It’s fair to say that an individual who uses ChatGPT or Gemini isn’t a problem, but it’s also important to realize that millions of users pounding on an AI service can grow into a problem quite quickly. Unfortunately, it’s also true that increases in efficiency often don’t lead to reductions in energy use but to solving more complex problems within the same energy budget. We may be seeing that with reasoning models, image and video generation models, and other applications that are now becoming financially feasible. Does this problem require gigawatt data centers? No, not that, but it’s a problem that can justify the building of gigawatt data centers.
There is a solution, but it requires rethinking the problem. Telling people to use public transportation or bicycles for their commute is ineffective (in the US), as will be telling people not to use AI. The problem needs to be rethought: redesigning work to eliminate the commute (O’Reilly is 100% work from home), rethinking the way we use AI so that it doesn’t require cloud-hosted trillion parameter models. That brings us to using AI locally.
Staying Local
Almost everything we do with GPT-*, Claude-*, Gemini-*, and other frontier models could be done equally effectively on much smaller models running locally: in a small corporate machine room or even on a laptop. Running AI locally also shields you from problems with availability, bandwidth, limits on usage, and leaking private data. This is a story that would-be monopolists don’t want us to hear. Again, this is anecdata, but I’ve been very impressed by the results I get from running models in the 30 billion parameter range on my laptop. I do vibe coding and get mostly correct code that the model can (usually) fix for me; I ask for summaries of blogs and papers and get excellent results. Anthropic, Google, and OpenAI are competing for tenths of a percentage point on highly gamed benchmarks, but I doubt that those benchmark scores have much practical meaning. I would love to see a study on the difference between Qwen3-30B and GPT-5.
What does that mean for energy costs? It’s unclear. Gigawatt data centers for doing inference would go unneeded if people do inference locally, but what are the consequences of a billion users doing inference on high-end laptops? If I give my local AIs a difficult problem, my laptop heats up and runs its fans. It’s using more electricity. And laptops aren’t as efficient as data centers that have been designed to minimize electrical use. It’s all well and good to scoff at gigawatts, but when you’re using that much power, minimizing power consumption saves a lot of money. Economies of scale are real. Personally, I’d bet on the laptops: Computing with 30 billion parameters is undoubtedly going to be less energy-intensive than computing with 3 trillion parameters. But I won’t hold my breath waiting for someone to do this research.
There’s another side to this question, and that involves models that “reason.” So-called “reasoning models” have an internal conversation (not always visible to the user) in which the model “plans” the steps it will take to answer the prompt. A recent paper claims that smaller open source models tend to generate many more reasoning tokens than large models (3 to 10 times as many, depending on the models you’re comparing), and that the extensive reasoning process eats away at the economics of the smaller models. Reasoning tokens must be processed, the same as any user-generated tokens; this processing incurs charges (which the paper discusses), and charges presumably relate directly to power.
While it’s surprising that small models generate more reasoning tokens, it’s no surprise that reasoning is expensive, and we need to take that into account. Reasoning is a tool to be used; it tends to be particularly useful when a model is asked to solve a problem in mathematics. It’s much less useful when the task involves looking up facts, summarization, writing, or making recommendations. It can help in areas like software design but is likely to be a liability for generative coding. In these cases, the reasoning process can actually become misleading—in addition to burning tokens. Deciding how to use models effectively, whether you’re running them locally or in the cloud, is a task that falls to us.
Going to the giant reasoning models for the “best possible answer” is always a temptation, especially when you know you don’t need the best possible answer. It takes some discipline to commit to the smaller models—even though it’s difficult to argue that using the frontier models is less work. You still have to analyze their output and check their results. And I confess: As committed as I am to the smaller models, I tend to stick with models in the 30B range, and avoid the 1B–5B models (including the excellent Gemma 3N). Those models, I’m sure, would give good results, use even less power, and run even faster. But I’m still in the process of peeling myself away from my knee-jerk assumptions.
Bigger isn’t necessarily better; more power isn’t necessarily the route to AI dominance. We don’t yet know how this will play out, but I’d place my bets on smaller models running locally and trained with efficiency in mind. There will no doubt be some applications that require large frontier models—perhaps generating synthetic data for training the smaller models—but we really need to understand where frontier models are needed, and where they aren’t. My bet is that they’re rarely needed. And if we free ourselves from the desire to use the latest, largest frontier model just because it’s there—whether or not it serves your purpose any better than a 30B model—we won’t need most of those giant data centers. Don’t be seduced by the AI-industrial complex.
Footnotes
- I’m not aware of games between Sockfish and more recent Claude 4, Claude 4.1, and GPT-5 models. There’s every reason to believe the results would be similar.
- Kevlin Henney makes a related point in “Scaling False Peaks.”
#Megawatts #Gigawatts #OReilly