Search: online strategy & metrics

Demystifying evals for AI agents

… As shown in the illustrative YAML file below, one could evaluate this agent using both graders and metrics. task: id: "fix-auth-bypass 1" desc: "Fix authentication bypass when password field is empty and ..." graders: - type: deterministic tests required: test empty pw rejected.py, test null pw rej… …

Jan 9, 2026

Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench

… The other strategy is something we human scientists could learn from: when Claude is uncertain about an answer, it layers multiple methods and combines different lines of evidence to arrive at a conclusion. …

Apr 29, 2026

Measuring AI agent autonomy in practice

… However, the two metrics are not directly comparable. …

Feb 18, 2026

Project Vend: Phase two

… It could now use a web browser to check prices and delivery information on websites by itself, and could do deeper research online to find and compare suppliers we still didn’t give it access to a payment interface, to ensure it always checked with a human before making purchases . …

Dec 18, 2025

Anthropic Economic Index report: Cadences

… This pattern is consistent with the possibility that AI substitutes for a larger share of the tasks that workers in lower-income countries do day-to-day, even if occupation-level exposure metrics—which tend to be higher in advanced economies—suggest otherwise. …

Jun 26, 2026

How AI Is Transforming Work at Anthropic

… The year-on-year comparison is quite dramatic—this suggests a more than 2x increase in both metrics in one year. …

Dec 2, 2025

Followed topics