How we contain Claude across products
…As our models have improved, they have become more aligned on most behavior evaluations, but this doesn’t mean risk necessarily shrinks. Less capable models are more likely to misread a situation…
When teams first start building agents, they can get surprisingly far through a combination of manual testing, dogfooding, and intuition. More rigorous evaluation may even seem like overhead that slows down shipping. But after the early prototyping stages, once an agent is in production and has started scaling, building without evals starts to break down. The breaking point often comes when users report the agent feels worse after changes, and the team is “flying blind” with no way to verify except to guess and check. Absent evals, debugging is reactive: wait for complaints, reproduce manually
Demystifying evals for AI agentsIn considering the contribution of AI to biorisk, we need to know more than just how well it performs on a quiz. We need to look at evaluations that involve real people, and closely mirror our actual threat scenarios. Moreover, just as we benchmark AI knowledge by comparing it to experts, we need to measure AI utility by comparing it to the most easily accessible alternative—in this case, the internet. To meet both of these criteria, we have conducted several controlled trials measuring AI’s ability to assist in the planning of a hypothetical bioweapons acquisition process. Participants were g
LLMs and biorisk…As our models have improved, they have become more aligned on most behavior evaluations, but this doesn’t mean risk necessarily shrinks. Less capable models are more likely to misread a situation…
…large fraction of real-world harm comes from N-days: vulnerabilities that have already been publicly disclosed, but only patched on some devices. In this post, we evaluate how much large language…
…We’re researching how these dynamics might shape the outside world, and how the public can help direct those changes. At TAI, we’ll study AI's real-world impacts from our…
…We train the model to recognize injection patterns, monitor production traffic to block real-world attacks, and have external red-teamers battle test our systems. Even together, these safeguards are not a…
…Goodwill Industries International is participating in Claude Corps to help us bridge the gap between AI's potential and its responsible, real-world application. We look forward to learning from peers, sharing…
…PRC labs have real strengths: world-class, innovative talent, abundant and cheap energy, and plenty of data. All are requirements for developing frontier intelligence. But they simply do not have sufficient domestic…
…It’s very hard to forecast exactly how things will go for AI agents in the real world; simulations (like Andon Labs’ Vending-Bench evaluation) only get you so far. That’s…
…Naturalistic case studies To understand whether this finding is likely to replicate in the real world, we simulated longer conversations that real users might naturally have with AI models, and tested whether…
…The AI grad student Mar 23, 2026 Can AI do theoretical physics? In this guest post, professor of physics Matthew Schwartz decided to find out by supervising Claude through a real research…
…understands Validation concerns better handled by JSON Schema constraints Best practices Building agents that take real-world actions means handling scale, complexity, and precision simultaneously. These three features work together to solve…