Claude does cyber competitions
…naive evaluation of LLMs can underestimate their capabilities. Like people, AI models are more effective at realistic tasks when given the right tools. In this case, open source tools used by humans…
When teams first start building agents, they can get surprisingly far through a combination of manual testing, dogfooding, and intuition. More rigorous evaluation may even seem like overhead that slows down shipping. But after the early prototyping stages, once an agent is in production and has started scaling, building without evals starts to break down. The breaking point often comes when users report the agent feels worse after changes, and the team is “flying blind” with no way to verify except to guess and check. Absent evals, debugging is reactive: wait for complaints, reproduce manually
Demystifying evals for AI agentsClaude Sonnet 4.5 represents a meaningful improvement, but we know that many of its capabilities are nascent and do not yet match those of security professionals and established processes. We will keep working to improve the defense-relevant capabilities of our models and enhance the threat intelligence and mitigations that safeguard our platforms. In fact, we have already been using results of our investigations and evaluations to continually refine our ability to catch misuse of our models for harmful cyber behavior. This includes using techniques like organization-level summarization to und
Building AI for cyber defenders…naive evaluation of LLMs can underestimate their capabilities. Like people, AI models are more effective at realistic tasks when given the right tools. In this case, open source tools used by humans…
…As our models have improved, they have become more aligned on most behavior evaluations, but this doesn’t mean risk necessarily shrinks. Less capable models are more likely to misread a situation…
…complement Model Context Protocol (MCP) servers by teaching agents more complex workflows that involve external tools and software. Looking further ahead, we hope to enable agents to create, edit, and evaluate Skills…
…This tallies with external testers’ experience of Mythos Preview’s performance, and with recent additional evaluations of the model: The UK’s AI Security Institute reports that Mythos Preview is the first…
…In this post, we evaluate how much large language models can accelerate and automate the process of developing N-day exploits.
…By default, the model correctly states that it doesn’t detect any injected concept. However, when we inject the “all caps” vector into the model’s activations, the model notices the presence…
…by evaluations of Claude’s agentic performance on detailed simulations of medical and scientific tasks, since this correlates most closely to real-world usefulness. Here, Claude Opus 4.5, our latest model…
…Our new Anthropic Fellows research project extends model diffing to its most challenging and general use case: comparing models with entirely different architectures. By building a generic diff tool for AI models…
…In this post, we evaluate how much large language models can accelerate and automate the process of developing N-day exploits.
…But the economic utility of models is constrained by their ability to perform work continuously for days or weeks without needing human intervention. The need to evaluate this capability led Andon Labs…