Search: model capability progress

Demystifying evals for AI agents

…frontier models are now nearing saturation at >80%. As evals approach saturation, progress will also slow, as only the most difficult tasks remain. This can make results deceptive, as large capability improvements…

Jan 9, 2026

2028: Two scenarios for global AI leadership

…Compounding the problem, labs in China often release dual-use capable models as open-weight. Once a model is open-weight, safeguards that do exist can be removed, making the model available…

May 14, 2026

Building Effective AI Agents

…Routing easy/common questions to smaller, cost-efficient models like Claude Haiku 4.5 and hard/unusual questions to more capable models like Claude Sonnet 4.5 to optimize for best performance…

Dec 19, 2024

Estimating AI productivity gains

…We’ll be tracking these changes over time as part of our Economic Index as model capabilities, products, and adoption continue to progress. These productivity gains come from making existing tasks faster…

Nov 25, 2025

Project Fetch: Can Claude train a robot dog?

…One way of understanding and tracking the capabilities of AI models is to run an “uplift” study. These experiments randomly divide participants into two groups—one with access to AI and one…

Nov 12, 2025

Teaching Claude why

…We are encouraged by this progress, but significant challenges remain. Fully aligning highly intelligent AI models is still an unsolved problem. Model capabilities have not yet reached the point where alignment failures…

May 8, 2026

Introducing Sonnet 4.6

…But the rate of progress is remarkable nonetheless. It means that computer use is much more useful for a range of work tasks—and that substantially more capable models are within reach…

Feb 17, 2026

Claude Opus 4.6

…Claude Opus 4.6 is the best model we've tested yet. Its reasoning and planning capabilities have been exceptional at powering our AI Teammates. It's also a fantastic coding model…

Feb 5, 2026

How we contain Claude across products

…Progress on safeguards and model training has steadily driven down the first; the second—the theoretical blast radius—only grows as capabilities and access expand. Yet as agents become capable of doing…

May 25, 2026

Advancing Claude in healthcare and the life sciences

…The model is excellent at coding, reasoning about biology, and understanding scientific figures. Anthropic's models are unmatched in their reasoning capabilities and safety design. Claude has fundamentally changed what's possible…

Jan 11, 2026

Followed topics