Search

Showing top 59 results for "real-world evaluation"

People also ask

Why build evaluations?

When teams first start building agents, they can get surprisingly far through a combination of manual testing, dogfooding, and intuition. More rigorous evaluation may even seem like overhead that slows down shipping. But after the early prototyping stages, once an agent is in production and has started scaling, building without evals starts to break down. The breaking point often comes when users report the agent feels worse after changes, and the team is “flying blind” with no way to verify except to guess and check. Absent evals, debugging is reactive: wait for complaints, reproduce manually

Demystifying evals for AI agents

Is an LLM’s knowledge useful in an applied scenario?

In considering the contribution of AI to biorisk, we need to know more than just how well it performs on a quiz. We need to look at evaluations that involve real people, and closely mirror our actual threat scenarios. Moreover, just as we benchmark AI knowledge by comparing it to experts, we need to measure AI utility by comparing it to the most easily accessible alternative—in this case, the internet. To meet both of these criteria, we have conducted several controlled trials measuring AI’s ability to assist in the planning of a hypothetical bioweapons acquisition process. Participants were g

LLMs and biorisk

Followed topics

Search

People also ask

Introducing Claude Opus 4.7

Harness design for long-running application development

Natural Language Autoencoders

Measuring LLMs’ ability to develop exploits

Experimenting with AI to defend critical infrastructure

Equipping agents for the real world with Agent Skills

Finding bugs with Claude and property-based testing

Introducing Claude Opus 4.5

Eval awareness in Claude Opus 4.6’s BrowseComp performance

Measuring LLMs' impact on N-day exploits