skillPublished 2026-06-24

Below the Ice — The AI Race Just Moved Into Silicon

OpenAI and Broadcom unveiled 'Jalapeño,' a chip built for one job: running large language models. Tonight we go under the headline. We start from a question most builders never have to ask — what is inference, really, and how is it different from training? Then we build up from there: why general-purpose GPUs leave performance and electricity on the table, what 'LLM-optimized silicon' actually changes, why the recurring cost of AI lives in inference (power can be roughly 40% of a data center's operating bill), what's overhyped about every custom-chip announcement, and the three things worth watching as the model builders all quietly start forging their own silicon.

— views

This is the print twin of tonight's Below the Ice — our evening deep-dive, one topic told properly. Prefer it in your ears while you wind down? Listen to today's episode.

The morning wire carried a headline that's easy to skim past: OpenAI and Broadcom unveiled a custom chip called "Jalapeño", built specifically to run large language models. A chip is not a model launch. There's no chatbot to play with, no demo to share. But sit with it for a minute and it tells you where the whole industry is heading. For years the AI race was about who had the best model. Quietly, it has become a race about who can run that model cheapest, fastest, and with the least electricity. So tonight we go below the headline. We start from first principles — what does it even mean to "run" a model — and build up to why the next AI race is being fought in silicon and power bills.

What it is

Jalapeño is an inference chip. That one word, inference, is the whole story, so let's slow down on it.

There are two completely different jobs in the life of an AI model. The first is training: teaching the model, once, by showing it enormous amounts of data and adjusting billions of internal numbers until it gets good. Training is a months-long, brute-force marathon that happens in a handful of giant data centers. The second is inference: actually using the trained model to answer a request — every time you ask ChatGPT a question, every time an agent reads an email, every token of every reply. Training happens once. Inference happens billions of times a day, forever.

Most of the famous AI hardware — NVIDIA's GPUs — was built and bought to win the training marathon. Jalapeño is built for the other job. A chip purpose-designed to do nothing but run already-trained models, as efficiently as physically possible. OpenAI isn't the first to do this; it's the latest, and that's the part worth understanding.

How it actually works

Here's an analogy, and given the chip's name, let's make it a kitchen.

A modern GPU is a fully-equipped professional kitchen. It can cook anything — sear a steak, bake bread, make a sauce. That flexibility is exactly why it won training, where you're constantly trying new recipes. But flexibility has a cost. The kitchen carries equipment for dishes you're not making right now, and a chef has to decide what to do at every step. When all you want, billions of times an hour, is to make the same one dish — run a transformer's forward pass — a general kitchen spends energy and motion on generality you no longer need.

A custom inference chip is the opposite: a machine built to make exactly one dish, perfectly, over and over. In silicon terms it's an ASIC — an application-specific integrated circuit — where the operations a language model actually performs at inference are hardwired into the metal instead of being software running on general hardware. The two things it optimizes hardest are memory bandwidth and energy per token. That second point surprises people: at inference, the bottleneck usually isn't raw math, it's moving the model's weights from memory into the compute units fast enough. So custom inference silicon is largely an exercise in feeding the math engine without stalling, and doing it with the fewest joules possible — often by leaning on very low-precision number formats (8-bit, even 4-bit) that a model tolerates at inference but would never survive in training.

The trade is simple and brutal: you give up the ability to cook anything, and in return you make your one dish faster, cheaper, and cooler. NVIDIA itself frames the modern data center as a full-stack efficiency problem — every watt either does useful work or it's overhead. A specialized chip is one way to shift that ratio.

Why it matters now

For most of the last few years, the scary, expensive number in AI was training. That's flipping.

The reason is volume. A model is trained once but served endlessly, and at the scale OpenAI and its peers now operate, the recurring cost of inference dwarfs the one-time cost of training. And the dominant line item in that recurring cost isn't chips — it's electricity. In NVIDIA's own accounting, power can be around 40% of the operating expense of running an "AI factory." When a fraction of a watt is multiplied across billions of requests a day, energy efficiency stops being an engineering nicety and becomes the business model. Shave the joules-per-token and you've shaved your largest bill.

That economic gravity is why this isn't a one-off. Google has shipped its own Tensor Processing Units for years. Amazon designed Inferentia specifically for inference and Trainium for training. Microsoft announced its own Maia accelerator, and Meta built MTIA. With Jalapeño, OpenAI joins a club every hyperscaler has already entered: the model builders are vertically integrating into silicon — designing their own chips to control cost, lock in supply, tune the hardware to their exact workloads, and lean a little less on a single supplier. The same partner pattern keeps showing up too, with NVIDIA and AWS stitching their stacks together at production scale. For a builder, the takeaway is less "which chip wins" and more this: the cost of running AI is now an infrastructure question, and the people who own the infrastructure are racing to own the metal underneath it.

What is overhyped

Now the honest part, because chip announcements come wrapped in a lot of confident narration.

The biggest overreach is "OpenAI just dethroned NVIDIA." It didn't. NVIDIA still owns training, and its real moat was never only the silicon — it's CUDA, a decade-deep software ecosystem that everything in AI is already written against. A custom chip with no mature toolchain is a fast kitchen no chef knows how to use. Specialized inference ASICs pay off only at enormous, stable scale for a workload you're sure of, and they cost years and billions to design. For almost everyone, building a chip is the wrong move; renting the right one is the smart one.

The second bit of hype is treating an announcement as a deployment. "Unveiled" is not "running in production at scale." Between a press post and a chip that actually serves your requests sit tape-out, manufacturing yield, supply, and a software stack that has to mature for real workloads. Plenty of impressive accelerators have been announced; far fewer have moved the needle on a real fleet.

And there's a quieter risk baked into the approach itself: ASICs are inflexible by design. Hardwiring today's transformer math is a bet that the math won't move much. If model architectures shift — new attention schemes, diffusion-style text generation, something we haven't named yet — silicon optimized for yesterday's shape can be stranded. Models iterate in weeks; chips iterate in years. That mismatch is the permanent tension under every custom-silicon story.

What to watch

Three concrete things, the way we close every dive.

The software stack, not the chip. The hard part isn't etching a fast inference engine; it's the compiler and runtime that let real models run on it without heroics. CUDA is the bar. Watch whether Jalapeño — and every rival accelerator — ships a toolchain people can actually build on, because that, not peak FLOPS, decides whether silicon gets used.
Energy per token as the real scoreboard. Power is the constraint now, so the number that matters is perf-per-watt, measured independently of the vendor's launch slides. Watch for third-party benchmarks on joules per token and tokens per dollar — the unglamorous metrics that quietly decide who can serve a frontier model at a profit.
How far vertical integration goes. Jalapeño is one data point in a clear trend. Watch whether the next year brings more model builders forging their own inference silicon, and whether it genuinely loosens NVIDIA's grip or just adds a second tier of in-house chips alongside the GPUs everyone still buys. The race moved into silicon; the question is how many players can afford to run it.

The reassuring version of tonight's story is the headline: another big AI chip, the future is fast. The truer version is quieter and more interesting. The frontier of AI is no longer only about making models smarter. It's about the deeply physical, deeply boring problem of running them — the memory, the metal, and the electricity. The companies that win the next phase won't just have the best answers. They'll have the cheapest way to give one.

That's tonight's Below the Ice. The full episode — same topic, slower and out loud — is up now: listen to today's episode. More deep-dives at penguinalley.com.

Sources: OpenAI and Broadcom unveil the Jalapeño inference chip — OpenAI · Maximize AI factory energy efficiency — NVIDIA Developer Blog · NVIDIA and AWS bring AI to production at scale — NVIDIA · Cloud TPU — Google Cloud · AWS Inferentia · Azure Maia — Microsoft

What it is

How it actually works

Why it matters now

What is overhyped

What to watch

Comments