Below the Ice — Eighteen Answers Hiding in Unsolved Cases
El titular de esta mañana tuvo tres líneas: un modelo de razonamiento de IA ayudó a médicos a resolver 18 casos de enfermedades raras que llevaban años sin respuesta. Esta noche vamos por debajo — qué pasó realmente en el estudio, cómo razona un modelo sobre el genoma de un niño, por qué importa para las familias que siguen esperando, y la advertencia honesta que un nuevo benchmark vuelve difícil de ignorar.

This is the print twin of tonight's Below the Ice — our evening deep-dive, one topic told properly. Prefer it in your ears while you wind down? Listen to today's episode.
This morning the radar gave it three lines, filed under Health: an OpenAI reasoning model helped clinicians diagnose rare genetic diseases in children, surfacing 18 new diagnoses in previously unsolved cases. It deserved more than a bullet point. So tonight we go below it — slowly, from first principles, because the thing under this headline is both more hopeful and more fragile than a one-liner can hold.
What it is
Here is the plain version. Researchers from Boston Children's Hospital's Manton Center for Orphan Disease Research, Harvard University, and OpenAI took 376 previously analyzed rare-disease cases that remained unsolved — children whose families had already been through extensive genetic testing and specialist review and still had no answer — and ran their de-identified clinical and genomic data back through an AI reasoning model, OpenAI's o3 Deep Research. The model surfaced evidence-linked candidate explanations. Specialists reviewed those leads, ordered more testing, and confirmed them the ordinary way. The result, published in NEJM AI: physicians established diagnoses in 18 cases — an additional diagnostic yield of 4.8% on top of everything the human experts had already found.
Read that number twice, because it points both ways. Eighteen families who had been waiting, sometimes for years, now have a name for what their child has. And the model did not invent anything exotic — it found answers that were already sittable in the data, waiting for someone, or something, with the patience to look again.
The backdrop is the part most people outside medicine don't know: even with full genomic sequencing, roughly half of people with rare diseases never get a clear genetic diagnosis. The clues are often there. Finding them can mean sifting through thousands to millions of possible genetic variants, fragmented records spread across years, and a scientific literature that changes faster than any one clinician can read it. That last bit matters: a case that was genuinely unsolvable in 2023 can quietly become solvable in 2026, the day a new gene-disease link gets published. Nobody goes back to re-open it. That re-opening is what this study automated.
How it actually works
Strip away the word "AI" and what o3 Deep Research is doing here has a precise name: abduction — inference to the best explanation. Not deduction (facts forcing a conclusion), not pure pattern-matching (this looks like that). Abduction is what a good diagnostician does: gather a messy, incomplete set of observations — a child's symptoms, family history, inheritance pattern, a list of genetic variants of unknown significance — and reason backward to the hypothesis that would best account for all of it.
An analogy. Picture a tireless medical resident who has read every case report published this year, can hold a patient's entire chart and genome in working memory at once without dropping a detail, and is willing — at 3 a.m., for the four-hundredth file — to ask "what if we re-read this one against what we learned last month?" That resident doesn't know the answer and isn't allowed to declare one. What they produce is a short list of the most plausible explanations, each with its evidence trail attached, handed to an attending physician who decides what's worth chasing. That's the shape of what happened here: the model widened the search and focused the follow-up; humans made every call.
The crucial property of this kind of reasoning is that it's defeasible — every conclusion is provisional, held only until a better fact arrives. The best explanation today gets overturned tomorrow when a confirmatory test comes back negative, and a good reasoner has to revise gracefully rather than dig in. Hold onto that word, defeasible. It's the hinge the whole story turns on, and it's where the optimism meets its limit.
Why it matters now
For the families, the stakes are obvious and human: an answer ends a diagnostic odyssey, sometimes unlocks a treatment, and at minimum replaces "we don't know" with a name and a community. A 4.8% bump sounds modest until you remember it's 4.8% of the cases the best specialists had already given up on.
For builders, the interesting signal is the pattern, and it generalizes well beyond medicine. This is a clean example of AI as a reanalysis engine over a backlog — pointing a patient, literature-aware reasoner at a pile of cases that were closed not because they were unsolvable but because re-opening each one by hand was never worth a human's time. Almost every domain has that pile: support tickets marked "could not reproduce," security findings filed as inconclusive, experiments shelved when the method wasn't ready. The economics that changed isn't accuracy — it's that the cost of a careful second look dropped close to zero, so "look again, now that we know more" becomes a thing you can actually afford to do at scale.
And notice the architecture the researchers were careful to keep: the model proposes, the human disposes. The OpenAI writeup is blunt that the model "did not diagnose any participant; physicians and other qualified clinical experts made every diagnosis." That isn't a disclaimer bolted on for lawyers. As we're about to see, it's load-bearing.
What is overhyped
Here's the honest part, the reason we don't end on the inspiring number.
First, the study's own fine print. It was retrospective — looking back at solved-after-the-fact cases, not running live in a clinic. The cohorts were heterogeneous, the reviewers were not blinded to how confident the model was, and the team explicitly did not measure the things that decide whether this is practical: time saved, cost, clinician effort, or false-positive burden — how many wrong leads a specialist had to chase for every good one. The authors say it plainly: this is "not evidence that patients, clinicians, or customers should use OpenAI models to diagnose disease." It's a research result about widening a search, not a product you point at your kid's symptoms.
Second, and deeper, is exactly the property we flagged: defeasible reasoning is where these models are most fragile. A paper that landed the same week makes this uncomfortably concrete. DeFAb — a benchmark for "defeasible abduction," the precise reasoning move diagnosis requires — pits frontier models against a rule-based logic solver. The solver resolves every instance in under 50 microseconds at 100% accuracy. The best frontier language model reaches 65% at best, and drops to 23.5% under "rendering-robust" evaluation — meaning when you keep the underlying problem identical and merely change its surface wording. A model that looks like a brilliant diagnostician on one phrasing can collapse to coin-flip-or-worse on a reworded version of the same case. Fluent, confident, and wrong is the failure mode, and it doesn't announce itself.
That's why "every result passed through human adjudication and clinical confirmation" isn't a courtesy line. It's the only thing standing between a plausible-sounding explanation and a real one. The hype reads this study as "AI is diagnosing rare diseases now." The honest read is "AI got remarkably good at generating leads a human expert still has to verify — and we have fresh evidence that its reasoning is brittle in exactly the way that makes verification non-negotiable."
What to watch
Three concrete things, the way we close every dive.
- Prospective trials, with the boring metrics. The next stage has to run forward, in real clinics, comparing AI-assisted reanalysis against standard practice — and actually measuring time-to-candidate, false-positive workload, clinician effort, and cost. Until those numbers exist, the yield is promising, not proven. Watch for multi-center studies that report them.
- Reanalysis becoming routine, not heroic. OpenAI says the Manton Center will lead the next stage through a grant from the OpenAI Foundation, aimed at a platform-agnostic, low-cost genetics AI copilot. The thing to watch is whether re-opening cold cases turns from a one-off study into a standing, repeatable step in care — the difference between a headline and an infrastructure.
- Specialized models versus robustness gates. o3 Deep Research is general-purpose; purpose-built life-science systems like GPT-Rosalind are designed for deeper work on how variants affect protein structure. Watch whether specialist models push the yield past 4.8% — and, just as much, whether robustness benchmarks like DeFAb start deciding which models get trusted anywhere near a clinic in the first place.
The promise here is not that AI replaces a doctor. It's smaller and, honestly, better than that: carefully evaluated tools that help specialists notice the evidence worth investigating, so that for thousands of families, today's unanswered question doesn't have to stay unanswered forever. Keep the optimism. Keep the human in the loop. Both, at once.
That's tonight's Below the Ice. The full episode — same topic, slower and out loud — is up now: listen to today's episode. More deep-dives at penguinalley.com.
Sources: OpenAI — Using AI to help diagnose rare childhood diseases · NEJM AI study · DeFAb: A Verifiable Benchmark for Defeasible Abduction (arXiv)