Search

Showing top 66 results for "model-by-model evaluation"

People also ask

Why build evaluations?

When teams first start building agents, they can get surprisingly far through a combination of manual testing, dogfooding, and intuition. More rigorous evaluation may even seem like overhead that slows down shipping. But after the early prototyping stages, once an agent is in production and has started scaling, building without evals starts to break down. The breaking point often comes when users report the agent feels worse after changes, and the team is “flying blind” with no way to verify except to guess and check. Absent evals, debugging is reactive: wait for complaints, reproduce manually

Demystifying evals for AI agents

What's next?

Claude Sonnet 4.5 represents a meaningful improvement, but we know that many of its capabilities are nascent and do not yet match those of security professionals and established processes. We will keep working to improve the defense-relevant capabilities of our models and enhance the threat intelligence and mitigations that safeguard our platforms. In fact, we have already been using results of our investigations and evaluations to continually refine our ability to catch misuse of our models for harmful cyber behavior. This includes using techniques like organization-level summarization to und

Building AI for cyber defenders

Followed topics

Search

People also ask

Reverse engineering Claude's CVE-2026-2796 exploit

Paving the way for agents in biology

Introducing the Services Track and Partner Hub of the Claude Partner Network

Assessing Claude Mythos Preview’s cybersecurity capabilities

Australian government and Anthropic sign MOU for AI safety and research

Trustworthy agents in practice

Mapping AI-enabled cyber threats: Insights from the LLM ATT&CK Navigator

2028: Two scenarios for global AI leadership

Agentic coding and persistent returns to expertise

The assistant axis: situating and stabilizing the character of large language models