Demystifying evals for AI agents
…frontier models are now nearing saturation at >80%. As evals approach saturation, progress will also slow, as only the most difficult tasks remain. This can make results deceptive, as large capability improvements…
…frontier models are now nearing saturation at >80%. As evals approach saturation, progress will also slow, as only the most difficult tasks remain. This can make results deceptive, as large capability improvements…
…Compounding the problem, labs in China often release dual-use capable models as open-weight. Once a model is open-weight, safeguards that do exist can be removed, making the model available…
…Routing easy/common questions to smaller, cost-efficient models like Claude Haiku 4.5 and hard/unusual questions to more capable models like Claude Sonnet 4.5 to optimize for best performance…
…We’ll be tracking these changes over time as part of our Economic Index as model capabilities, products, and adoption continue to progress. These productivity gains come from making existing tasks faster…
…One way of understanding and tracking the capabilities of AI models is to run an “uplift” study. These experiments randomly divide participants into two groups—one with access to AI and one…
…We are encouraged by this progress, but significant challenges remain. Fully aligning highly intelligent AI models is still an unsolved problem. Model capabilities have not yet reached the point where alignment failures…
…But the rate of progress is remarkable nonetheless. It means that computer use is much more useful for a range of work tasks—and that substantially more capable models are within reach…
…Claude Opus 4.6 is the best model we've tested yet. Its reasoning and planning capabilities have been exceptional at powering our AI Teammates. It's also a fantastic coding model…
…Progress on safeguards and model training has steadily driven down the first; the second—the theoretical blast radius—only grows as capabilities and access expand. Yet as agents become capable of doing…
…The model is excellent at coding, reasoning about biology, and understanding scientific figures. Anthropic's models are unmatched in their reasoning capabilities and safety design. Claude has fundamentally changed what's possible…