Search

Showing top 61 results for "real-world evaluation"

People also ask

Why build evaluations?

When teams first start building agents, they can get surprisingly far through a combination of manual testing, dogfooding, and intuition. More rigorous evaluation may even seem like overhead that slows down shipping. But after the early prototyping stages, once an agent is in production and has started scaling, building without evals starts to break down. The breaking point often comes when users report the agent feels worse after changes, and the team is “flying blind” with no way to verify except to guess and check. Absent evals, debugging is reactive: wait for complaints, reproduce manually

Demystifying evals for AI agents

Is an LLM’s knowledge useful in an applied scenario?

In considering the contribution of AI to biorisk, we need to know more than just how well it performs on a quiz. We need to look at evaluations that involve real people, and closely mirror our actual threat scenarios. Moreover, just as we benchmark AI knowledge by comparing it to experts, we need to measure AI utility by comparing it to the most easily accessible alternative—in this case, the internet. To meet both of these criteria, we have conducted several controlled trials measuring AI’s ability to assist in the planning of a hypothetical bioweapons acquisition process. Participants were g

LLMs and biorisk

Followed topics

Search

People also ask

How we contain Claude across products

Mapping AI-enabled cyber threats: Insights from the LLM ATT&CK Navigator

Focus areas for The Anthropic Institute

Trustworthy agents in practice

Introducing Claude Corps

2028: Two scenarios for global AI leadership

Project Vend: Phase two

The assistant axis: situating and stabilizing the character of large language models

Vibe physics: The AI grad student

Introducing advanced tool use on the Claude Developer Platform