skillPublished 2026-06-29

Below the Ice — Retry Storms: The Day a Bad Loop Cost More Than a Month of Servers

An engineer opened the LLM cost graph and found one day spiking like Mount Fuji — a single day of AI calls that cost more than a full month of servers. The culprit wasn't a person; it was the retry machinery. Tonight we go under the headline into retry storms: what they are, how a deterministic failure plus an automatic retry plus a non-idempotent batch quietly compound into runaway cost, the counterintuitive twist where every LLM call actually succeeded and got billed before the job threw the result away and started over, and the backoff, jitter, idempotency, and budget-alarm habits that tame it. What's overhyped: this isn't an 'AI is too expensive' story — it's a decades-old distributed-systems bug wearing a new price tag.

— views

This is Below the Ice in print, our evening deep-dive, one topic told properly while you wind down. Prefer it in your ears? Listen to tonight's episode on the feed.

The story that earned tonight's dive was a small one on Hacker News — six points, a quiet little link with a long title: "Why did one day of AI cost more than a month of servers?". The post behind it, by cloud-infrastructure engineer Jumpei Ueno, opens with a picture anyone who runs a production bill will feel in their stomach. He was staring at his LLM API cost graph, and one day stuck up "like Mount Fuji". Every other day hugged the floor. That one day held roughly half the entire month's bill — a single day of AI usage that cost more than running the whole server fleet for a month. He asked the non-engineer colleague who'd shipped the feature what they did that day. The answer: "Honestly, I don't remember what I did." Tonight we go below that, because the truly unsettling part is that the answer is correct. No human burned the money. The retry machinery did. The name for what happened is a retry storm, and it's worth understanding from the ground up.

What it is

A retry storm is what happens when an automatic "try again" loop turns a single failure into a flood of repeated work that the system can't stop on its own. The retry itself is not the villain. Retrying is one of the oldest, kindest reflexes in distributed systems: networks drop packets, servers hiccup, a request that failed a second ago might sail through now. So we wrap calls in "if it fails, try again," and most days that quietly saves us.

A storm is what you get when that reflex fires against a failure that retrying cannot fix, with nothing in place to make the repeats cheap or to call them off. The classic shape, the one documented for years in reliability writing, is a flood of failures: a service gets briefly overloaded, every client times out and retries at once, the retries pile more load on the struggling service, and the extra load guarantees more timeouts. The system effectively attacks itself. Amazon's engineers call this retry amplification and warn that retries can "increase the load on the system being called by a large multiple" precisely when it is least able to take it.

What makes Ueno's incident a sharper teaching case is that his storm wasn't a flood of failures at all. It was a flood of successes — thrown away and re-bought, over and over. Same mechanism, far more expensive shape.

How it actually works

Let's build it from first principles, because three ordinary, individually-reasonable decisions combined into the bill.

His batch job did two things in order: first it fired a pile of queries at several LLMs (this is where the money goes — every token is metered), then it wrote the returned results to the database. The trouble lived in step two. The code had been deployed assuming a new database column existed, but the migration that adds that column hadn't been applied to production yet. Code first, schema second — the wrong order. So when the job reached the save step, the database threw column does not exist, and the whole job returned a 500.

Here is the counterintuitive heart of it, and it's worth going slow. When you hear "the job failed," you picture the API call bombing and wasting a shot. The opposite happened. Every LLM call succeeded. All 200s. Every one was billed, properly — you paid, you got the result back. The job only tripped on the very last step, the save. Ueno's analogy is perfect: imagine you finish a full restaurant course, pay the check, and right as you stand to leave you trip, fall, and lose your memory. You come to back in your seat and start eating the identical course again. What you already ate doesn't un-happen, but every round starts from zero. He counted the machine doing this 21 times.

Now the three decisions that meshed:

A deterministic failure. "The column doesn't exist" is not a transient blip. No amount of waiting grows the column. The failure returns identically every single time.
An automatic retry. A managed task queue saw the job die with a 500 and helpfully re-ran it — correct behavior for a network hiccup, ruinous behavior for an error that can never resolve.
A non-idempotent batch. "Idempotent" just means a repeat is safe — running it twice lands you in the same place as running it once. This batch wasn't. It didn't skip work it had already done, so every re-run started at the top and re-paid the full LLM bill.

Deterministic failure times automatic retry times non-idempotent work. When those three line up, money burns quietly and nobody is touching the keyboard. This is exactly why the reliability playbook exists. You cap retries so a loop can't run forever, and you reserve retries for errors that might actually clear — a 4xx that says "you're the one who's wrong" should abort immediately, not loop. You add exponential backoff with jitter so a herd of clients doesn't retry in lockstep and re-synchronize the stampede — AWS's own write-up on the technique shows that randomizing the wait is what actually breaks the thundering herd. You make cost-bearing work idempotent with a key, the way payment APIs do — Stripe's idempotency keys let a client safely retry a charge without ever double-billing. And you put a circuit breaker in front, so after N failures the system stops calling and gives the dependency room to recover, a pattern Google's SRE book covers under addressing cascading failures. None of this is new. It's the accumulated scar tissue of two decades of distributed systems.

Why it matters now

So why is a decades-old bug suddenly worth an evening? Because token-metered pricing quietly changed the blast radius.

For most of the history of retries, the cost of an extra attempt was a little CPU and a little bandwidth — close enough to free that we rarely thought about it. The worst case of a retry storm was an outage: the system fell over. Painful, but bounded, and usually self-evident the moment it happened. Per-token LLM pricing rewires the incentives. Now every retry of a step that calls a model has a direct, metered dollar cost, and a successful-but-discarded call costs exactly as much as a useful one. The failure mode shifts from "the system goes down" to "the invoice goes up," and that second one is sneakier. An outage pages you in minutes. A cost storm just sits there politely returning 200s, doing real billable work, until you happen to open the graph. As Ueno puts it, the only reason he caught it at all was that he happened to look — without budget alarms or separate keys for prod and test, "nobody notices until the invoice arrives."

There's a second reason it matters right now, and it's the part builders should sit with. This system was shipped to production in two days by a non-engineer using an AI coding assistant. That's genuinely new leverage, and it's not the thing to mock — it worked, it ran. But, in Ueno's framing, building the feature and seeing how it can break and how it can get expensive are two different skills, and the second one didn't come bundled in the two days. As more production code gets written by people who don't carry the distributed-systems scar tissue, the boring guardrails — retry ceilings, idempotency keys, backoff, budget alerts — stop being optional polish and become the actual job of whoever inherits the system.

What is overhyped

The honest caveat, because this story is easy to misread in two directions.

It is not an "AI is too expensive" story, and you should be suspicious of anyone who files it that way. The LLM pricing here behaved exactly as advertised: every call was real work and every charge was legitimate. Nothing about the model was overpriced or broken. Swap the LLM for any other metered, side-effecting API — a payments endpoint, an SMS gateway, a cloud function billed per invocation — and the same three-way collision produces the same runaway bill. The expensive ingredient wasn't intelligence; it was an un-capped, non-idempotent retry against a deterministic failure. Token pricing didn't create the bug. It just removed the discount that used to hide it.

And it's not a "vibe coding is reckless" story either. The lesson isn't "don't let non-engineers ship." The leverage is real and probably permanent. The lesson is narrower and more durable: speed of building has raced ahead of speed of hardening, and the gap is exactly where these incidents live. Treat the guardrails as part of "done," not as a thing you add after the graph spikes.

What to watch

Three concrete things, the way we always close.

Whether your retries have a ceiling and an abort rule. Go look, tonight, at one batch or queue job that calls a paid API. Does it retry forever? Does it retry on a deterministic 4xx that can never succeed? If yes to either, that's a cost storm with the safety off.
Whether cost is observable before the invoice. A spike you find on a graph by luck is a spike you found too late. Per-environment API keys, a hard budget cap, and an alert that fires on a daily-spend anomaly turn "we noticed at month-end" into "we got paged in an hour."
Whether idempotency is a default, not an afterthought. As more side-effecting work gets generated fast by assistants, the cheapest insurance is making re-runs safe by construction — an idempotency key on anything that costs money or writes to the world. The pattern is old; what's new is how many systems now need it on day one.

The surface story was a scary invoice. The story below it is quieter and older: a kindness — "if it fails, try again" — that turns cruel the moment it meets a failure it can't fix and a bill it can't see. Retry isn't always kindness. Tonight, that's the thing worth carrying to bed.

Sources: junueno.dev — "Why did one day of AI cost more than a month of servers?" (Jumpei Ueno) · Hacker News discussion · AWS Builders' Library — Timeouts, retries, and backoff with jitter · AWS Architecture Blog — Exponential Backoff And Jitter · Stripe — Idempotent requests · Google SRE Book — Addressing Cascading Failures

What it is

How it actually works

Why it matters now

What is overhyped

What to watch

Comments