skillPublished 2026-06-27

Below the Ice — Two Models Are Faster Than One: Speculative Decoding

Your language model writes one word at a time, each token waiting on the last — which is exactly why it feels slow. Tonight we go under the hood of speculative decoding, the draft-and-verify trick that DeepSeek's new DSpark paper uses to speed things up. A small, cheap model guesses several tokens ahead; the big, expensive model checks them all in a single pass, keeps the guesses it would have made anyway, and corrects the rest — so the output is identical to running the big model alone, just quicker. We build it from first principles, explain why verifying many tokens can cost about the same as generating one, why this matters now that inference cost and latency rule the economics of running models, what's overhyped about the headline speedup numbers, and the three things to watch next.

— views

This is the print twin of tonight's Below the Ice — our evening deep-dive, one topic told properly. Prefer it in your ears while you wind down? Listen to today's episode.

Here is a small thing you've probably felt without naming it: when a chatbot answers you, the words arrive — one after another, left to right, like someone typing in real time. That trickle isn't a UI flourish. It's the model genuinely producing text the only way it knows how, one token at a time, each one waiting on the last. Tonight's topic is the clever trick that breaks that single-file line without changing a single word of the output. It's called speculative decoding, and it's having a moment because the most-discussed paper on Hacker News today is DeepSeek's DSpark — a fresh take on the technique that drew nearly 700 points and almost 300 comments of builders picking it apart. We go below the headline: what the trick is, why it works at all, and whether "free speedup" is ever actually free.

What it is

Speculative decoding is a way to make a large language model generate text faster by pairing it with a second, much smaller model — and crucially, without changing the output. The big model is the one you actually want answers from. The small one is a fast, cheap stand-in.

The idea in one breath: let the small model guess the next several tokens, then let the big model check all of those guesses at once. Wherever the big model agrees with a guess, you keep it for free. At the first place it disagrees, you throw away the rest and take the big model's correction. Repeat. Because the big model is doing the final say-so on every token, the text you get out is — provably, not approximately — the same text the big model would have produced on its own. You just got there in fewer slow steps.

DeepSeek's DSpark, part of their DeepSpec project, is the newest entry in a line of work that goes back to a 2022 Google paper. It's worth being precise about what's old and what's new here: the core of speculative decoding is several years old and well understood. What papers like DSpark chase is the part that's still very much open — making the guessing better, the verification cheaper, and the whole thing easier to run in a real serving stack.

How it actually works

To feel why this helps, you have to understand why generation is slow in the first place — and the reason is counterintuitive. It is not that the math of producing one token is enormous. It's that producing each token requires reading the model's entire set of weights from memory, and on modern hardware that memory traffic, not the arithmetic, is the bottleneck. A single forward pass through a big model moves tens of gigabytes of weights just to emit one token. The GPU's math units spend most of that time idle, waiting on memory. Generation is, in the jargon, memory-bandwidth bound.

Here's the lever that fact hands you. If you've already paid to load the weights for one forward pass, checking ten candidate tokens in that same pass costs barely more than checking one — the expensive part (hauling the weights in) is already done; running ten positions through the math units in parallel is close to free, because those units were sitting idle anyway. Verifying many tokens is cheap. Generating them one by one is expensive. Speculative decoding is the trick that exploits exactly this asymmetry.

So picture the workflow as a junior writer and a senior editor. The junior — fast, a little sloppy — drafts the next sentence ahead: "the cat sat on the…" and guesses "mat." The senior editor reads the whole draft sentence in a single glance and, for each word, asks one question: is this the word I would have written here? As long as the answer is yes, the editor nods the words through. At the first word where the answer is no — say the editor would have written "windowsill," not "mat" — they cross out from there, write their own word, and hand it back. The junior drafts ahead from the new point. The finished page reads exactly as if the senior had written every word, because at every position the senior had veto power. But on the stretches where the cheap draft was right — and for predictable text it's right a lot — you skipped the senior's slow, one-word-at-a-time labor entirely.

A conceptual illustration of speculative decoding's three steps: a small fast model drafts a run of tokens ahead, the large model verifies them all in a single parallel pass, and the longest correct prefix is accepted while the first miss is corrected.

That's the whole mechanism, and it has three moving parts: drafting (the small model proposes a short run of tokens), parallel verification (the big model scores all of them in one forward pass), and acceptance (a sampling rule keeps the longest correct prefix and corrects the first miss). The math that makes the last step exactly preserve the big model's output distribution — not merely approximate it — was worked out in the original papers from Google and DeepMind, and it's the reason this is called lossless. What a new paper like DSpark contributes lives in the seams: better drafters, smarter verification, and squeezing more accepted tokens out of each expensive pass. The HN thread is largely builders comparing those seams against the methods already shipping in serving frameworks.

Why it matters now

A few years ago, inference speed was a nice-to-have. Today it's close to the whole game, for two reasons that compounded at once.

First, latency is the product. When you're chatting with an assistant or watching an agent work, the wait between tokens is the experience. Shaving it doesn't just feel nicer — it changes what's usable. An agent that has to emit thousands of tokens of reasoning before it acts is painful at one speed and pleasant at three times that speed, for identical output.

Second, and bigger: the token bills are exploding. The newest reasoning models don't answer in a sentence — they think in long internal monologues, emitting huge volumes of tokens per task. OpenAI recently reported that since late 2025 its median internal Codex output grew 56x in research and 27x in engineering. Every one of those tokens is, by default, one slow memory-bound step. When the volume of generated text climbs like that, the speed and cost of generation stop being an implementation detail and become the dominant line item in running a model at all. Speculative decoding is attractive precisely because it attacks that line item without the usual devil's bargain — no quantization blur, no smaller model, no quality traded for speed. The output is bit-for-bit what the big model would have said. That "free lunch" framing is exactly why it spreads, and exactly why we should poke at it.

What is overhyped

So let's poke. The honest caveats are all hiding inside one word in every headline: up to.

The speedups you read — "up to 2-3x," sometimes louder — are best-case numbers on friendly workloads. The actual gain rides almost entirely on one quantity: the acceptance rate, how often the big model agrees with the draft. On predictable text — boilerplate code, formatting, the obvious continuation of a sentence — acceptance is high and the speedup is real. On genuinely hard, creative, or surprising output, the cheap draft guesses wrong more often, more guesses get thrown away, and the advantage shrinks toward nothing. You are also, remember, paying to run the draft model on every step, including the tokens that get rejected. That overhead is small but it isn't zero, and on an already-saturated GPU serving many users in a big batch — where the math units aren't idle anymore — the free lunch can quietly cost something.

Then there's the gap between "lossless in theory" and "identical in practice." The math guarantees the output distribution is preserved, but a production serving stack is a thornier place than a paper: floating-point differences, how requests are batched together, and the interaction with other optimizations can all introduce wrinkles the clean proof doesn't cover. "Lossless" is a real and important property — it's what separates this from cutting quality to go fast — but treat a vendor's specific multiplier the way you'd treat any benchmark: ask what workload, what hardware, and what acceptance rate produced it. The technique is genuinely good. The number on the slide is a best case.

What to watch

Three concrete things, the way we close every dive.

Self-drafting methods that drop the second model. The tidiest version of this idea makes the big model draft for itself — predicting several of its own next tokens cheaply — so you don't have to find, train, and host a separate small model that's well-matched to the big one. Approaches in this family (EAGLE and Medusa are the names to know) are where a lot of the energy is. Watch whether DSpark's contribution gets absorbed into the open serving frameworks builders actually run — vLLM, TensorRT-LLM, SGLang — because adoption there, not the paper's own numbers, is what tells you it works.
Pushing the acceptance rate up. Since the whole payoff scales with how often guesses are accepted, the frontier is making drafts smarter: proposing trees of candidate continuations instead of a single line, training the drafter to mimic the target model more faithfully, adapting how far ahead it guesses based on confidence. Watch the average-accepted-tokens-per-step figure — that's the honest unit of progress, more than any "x faster" headline.
Speculative decoding meeting long-reasoning models. The reasoning-model token explosion is the biggest tailwind this technique has, and also its most interesting test. Long chains of "thinking" tokens are often more predictable than final prose, which could push acceptance rates — and real-world speedups — higher than the classic benchmarks suggest. Watch whether the labs shipping reasoning models lean on draft-and-verify as a default, because if generation speed is the tax on thinking out loud, this is the most promising way to cut it without dumbing the model down.

The reassuring story about LLM speed is the one about bigger GPUs: just throw more silicon at it. The quieter, more interesting story is the one tonight — that you can go faster by being clever about what's actually slow. Generation was never bottlenecked on math; it was bottlenecked on memory, on the single-file march of one token at a time. Speculative decoding looks at that line and asks a simple question: what if a cheap guesser ran ahead, and the expensive model only had to nod along? It's a good reminder that some of the best engineering isn't a bigger engine. It's noticing where the real wait is, and routing around it.

That's tonight's Below the Ice. The full episode — same topic, slower and out loud — is up now: listen to today's episode. More deep-dives at penguinalley.com.

Sources: DSpark: Speculative decoding accelerates LLM inference — DeepSeek DeepSpec · Hacker News discussion · Fast Inference from Transformers via Speculative Decoding — Leviathan et al., arXiv · Accelerating LLM Decoding with Speculative Sampling — DeepMind, arXiv · Median internal Codex output growth — Latent Space

What it is

How it actually works

Why it matters now

What is overhyped

What to watch

Comments