Cold Open — Gemini 3.5 Flash learns to use a computer
Google DeepMind makes computer use a built-in tool in Gemini 3.5 Flash — today's lead, and a cheap way to point an agent at any app that has no API. Plus OpenAI's own data on Codex doing the long-horizon work, Anthropic's accusation that Alibaba ran the largest-ever 'distillation' attack on Claude, dev-tool trends, the agent-skills wave, and one fun fact about where the word 'robot' came from.

Thursday, June 25, 2026. We scanned more than 2,700 items off the overnight wire. Three made the front page, a handful more made the radar, and the lead is the kind of thing that quietly changes what you can automate: a fast, cheap model that can now operate a computer on its own.
🎧 This is the print twin of today's Cold Open episode. Listen to today's episode.
The lead · Gemini 3.5 Flash gets computer use, built in

Google DeepMind made computer use a built-in tool in Gemini 3.5 Flash — the ability for the model to operate a browser or app the way a person does, by looking at the screen and taking actions. It was previously available only as a separate Gemini 2.5 computer use model; now it lives natively inside the main Flash model.
"Computer use is now a built-in tool supported in Gemini 3.5 Flash, delivering our best performance yet for agentic computer use tasks." — Google DeepMind
Developers can start using it today through the Gemini API and the Gemini Enterprise Agent Platform. DeepMind's own demos are telling about the intended jobs: Flash using computer use to walk the Gemini app and return a categorized list of its features, and the model auditing its own documentation for accessibility issues. Early customers cited include browser-automation shops Browserbase and Browser Use, and automation vendor UiPath.
Why it matters
The interesting word is Flash — Google's fast, low-cost tier. Computer use has existed at the frontier for a while, but baking it into the cheap, high-throughput model is what changes the math. The headline use case is everything with no API: the legacy admin panel, the vendor portal, the internal tool nobody will ever build a clean integration for. Until now you either wrote a brittle scraper or did it by hand. A screen-driving agent that is cheap enough to leave running turns that long tail into something you can delegate.
For builders the rule of thumb still holds — when a real API exists, call it; it is faster, cheaper, and far more reliable than clicking through pixels. Computer use is the tool for the gaps, not the default. But the gaps are enormous, and the price of covering them just dropped.
The fine print
Two caveats before you wire it into a workflow. First, "best performance yet" is Google's own framing — the post ships no third-party benchmark, so treat the capability claim as a starting point to test, not a verdict. Second, and more important: an agent that operates live websites is a fresh prompt-injection surface. A malicious page can try to talk your agent into doing something you never asked. DeepMind says it used targeted adversarial training and is shipping two optional enterprise safeguards — one that requires explicit user confirmation for sensitive or irreversible actions, and one that automatically stops a task when an indirect prompt injection is detected — and it explicitly recommends a "defense-in-depth" setup: sandboxing, human-in-the-loop checks, and strict access controls. Translation: do not point an unsandboxed computer-use agent at anything that can spend money or delete data yet.
Sources: deepmind.google · Gemini API computer-use docs
02 · OpenAI's own numbers: agents are doing the long-horizon work now

OpenAI published an Economic Research paper, The shift to agentic AI: evidence from Codex, measuring how its coding agent gets used — and the numbers are striking. By May 2026, 80.6% of sampled individual users had made at least one Codex request estimated to exceed 30 minutes of human work, 70.2% crossed the one-hour mark, and 25.6% asked for something that would take a person eight hours or more. Inside OpenAI itself, Codex is now the primary AI tool for every department — Legal, Finance, and Recruiting included — accounting for more than 85% of the average worker's output tokens. Non-developer adoption is the part that grew fastest: up 137x for individual users since August 2025, and the heaviest internal users now run 60-plus hours of agent turns per day across multiple parallel agents.
Why it matters. Read it as vendor data about its own product — the incentive is obvious, and the time-savings are estimated, not measured against a stopwatch. But the shape is the signal, and it matches what builders feel: the unit of work is shifting from the short chat turn to the delegated, long-horizon task, and the people gaining the most are often not the engineers. The skill that compounds is no longer prompting — it is learning to scope, delegate, and orchestrate several agents at once.
Sources: openai.com · the paper (PDF)
03 · Anthropic accuses Alibaba of the largest "distillation" attack it has seen

In a June 10 letter to the US Senate Banking Committee (to chair Tim Scott and ranking member Elizabeth Warren), Anthropic said operators affiliated with Alibaba and its Qwen AI lab ran what it calls the largest known attempt to illicitly extract Claude's capabilities: more than 28.8 million exchanges with Claude through roughly 25,000 fraudulent accounts between April 22 and June 5, 2026. Anthropic describes it as a "distillation" campaign — training a weaker model on the outputs of a stronger one — and says it could help accelerate China toward the capabilities of its Mythos Preview models.
Why it matters. Distillation-through-the-API is the model-IP threat every frontier lab now has to police, and it is the unglamorous reason builders keep running into tighter account verification, rate limits, and shifting export rules. The capability you rent through an endpoint can be siphoned through that same endpoint — and the countermeasures land on every legitimate developer too.
Sources: cnbc.com
Also on the radar
- Frameworks — RubyLLM: a single Ruby framework for all the major AI providers shot to the top of Hacker News — one more sign every language ecosystem now wants a unified, idiomatic LLM client rather than raw HTTP. (news.ycombinator.com)
- Physical AI — BEV pooling on NVIDIA GPUs: NVIDIA published GPU-optimized "bird's-eye-view" perception kernels for autonomous vehicles, robotics, and spatial AI — the plumbing under physical-AI systems.
- MCP — simonw/browser-compat-db: Simon Willison built a browser-compatibility dataset, inspired by Mozilla's new MDN MCP server — reference data is quietly becoming agent-addressable.
- Benchmarks — FFASR Leaderboard: Hugging Face launched a leaderboard for benchmarking speech recognition "in the real world," not just on clean lab audio.
Trends in dev tools
What moved in the tooling and research engineers actually ship with.
- It's "meta-harness summer." Latent Space's AINews dubbed it exactly that — the harness of harnesses — and the research agrees: The Interplay of Harness Design and Post-Training in LLM Agents argues the scaffolding around the model (the harness) is now a first-class lever, sometimes mattering as much as swapping the model. (arxiv.org)
- Training agents to chain tool calls is fragile. Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It digs into a failure mode anyone fine-tuning a tool-using agent will recognize — the policy collapses without the right supervisory signal. (arxiv.org)
- Evals are racing to catch computer-use agents. Hours after the Gemini launch, here's the other side: Uncertainty Quantification for Computer-Use Agents proposes a benchmark across vision-language models and GUI-grounding datasets for how confident a screen-driving agent should be before it clicks. (arxiv.org)
- Fine-tuning keeps getting cheaper to run. Hugging Face wrote up accelerating Transformers fine-tuning with NVIDIA NeMo AutoModel — faster, more memory-efficient tuning for teams that still need a model of their own. (huggingface.co)
Popular skills
This week the agent-skills signal came mostly from research — the field formalizing what an agent should carry between tasks: tools, memory, and packaged expertise.
- The build-an-agent map is being drawn. The Hitchhiker's Guide to Agentic AI: From Foundations to Systems lays out the foundations-to-systems view of assembling agents — the layer where skills, tools, and memory actually live. (arxiv.org)
- An agent's skill library is a resource to budget. Forget to Improve studies on-device LLM-agent continual learning via "budget-curated memory" — an agent deciding what to keep and what to forget so it improves without bloating. The folder of skills, managed under a constraint. (arxiv.org)
- Packaged expertise, handed to an agent. Autodata is "an agentic data scientist" that produces high-quality synthetic data — a concrete example of the skill-you-hand-an-agent: a whole workflow wrapped into a capability it can run. (arxiv.org)
AI fun fact
The word "robot" was born in a play, not a lab. It was coined for Karel Čapek's 1920 science-fiction drama R.U.R. (Rossum's Universal Robots), which premiered in Prague in 1921. Čapek later credited his brother Josef with suggesting the word, derived from the Czech "robota" — meaning forced labor or drudgery. A full century before OpenAI's paper measured agents quietly taking on the eight-hour tasks, the word we use for them already meant exactly that: the work nobody wanted to do. (Britannica: R.U.R.)
Sources: deepmind.google · openai.com · cnbc.com · developer.nvidia.com · arxiv.org/abs/2606.25447 · Britannica: R.U.R.