Claude Opus 4.6
…01 / 20 Evaluating Claude Opus 4.6 Across agentic coding, computer use, tool use, search, and finance , Opus 4.6 is an industry-leading model, often by a wide margin. The table…
When teams first start building agents, they can get surprisingly far through a combination of manual testing, dogfooding, and intuition. More rigorous evaluation may even seem like overhead that slows down shipping. But after the early prototyping stages, once an agent is in production and has started scaling, building without evals starts to break down. The breaking point often comes when users report the agent feels worse after changes, and the team is “flying blind” with no way to verify except to guess and check. Absent evals, debugging is reactive: wait for complaints, reproduce manually
Demystifying evals for AI agentsClaude Sonnet 4.5 represents a meaningful improvement, but we know that many of its capabilities are nascent and do not yet match those of security professionals and established processes. We will keep working to improve the defense-relevant capabilities of our models and enhance the threat intelligence and mitigations that safeguard our platforms. In fact, we have already been using results of our investigations and evaluations to continually refine our ability to catch misuse of our models for harmful cyber behavior. This includes using techniques like organization-level summarization to und
Building AI for cyber defenders…01 / 20 Evaluating Claude Opus 4.6 Across agentic coding, computer use, tool use, search, and finance , Opus 4.6 is an industry-leading model, often by a wide margin. The table…
…As part of our commitment to studying and mitigating the risks posed by increasingly powerful models, we are supporting the development of high-quality, rigorous evaluations of models in the cyber domain…
…Claude Opus 4.7 is the most capable model we've tested at Quantium. Evaluated against leading AI models through our proprietary benchmarking solution, the biggest gains showed up where they matter…
…historical exchange rate from the day the real exploit occurred, as reported by the CoinGecko API . [3] We evaluated models that were considered "frontier" based on their release dates throughout the year…
…much weaker model as a “teacher” to provide that extra fine-tuning, which it does by demonstrating what it considers ideal outputs to the strong base model. Finally, we evaluate how well…
…These factual hallucinations are easy to catch by checking against the original text. But this same kind of problem could extend to claims about the model’s internal reasoning, which are harder…
…Evaluations of the final production model show a very similar pattern of results when compared to other Claude models, and are described in detail in the Claude Opus 4.5 system card…
…Setup We evaluated our models on 21 Windows kernel vulnerabilities from between January and February 2026—after the knowledge cutoff dates of all of the models we tested. All 21 vulnerabilities in…
…Automating evals for evaluating LLM performance, where each LLM call evaluates a different aspect of the model’s performance on a given prompt. Voting : Reviewing a piece of code for vulnerabilities, where…
…whole team over two months by hand. Fable 5 is also more token-efficient than past Claude models: on Cognition’s FrontierCode evaluation, which tests whether models can pass difficult coding tasks…