Search: Benchmarks and reliability

Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench

… To answer this, the research community has built several benchmarks. …

Apr 29, 2026

Eval awareness in Claude Opus 4.6’s BrowseComp performance

… This finding raises questions about whether static benchmarks remain reliable when run in web-enabled environments. …

Mar 6, 2026

Introducing Claude Opus 4.5

… For production code review at scale, that reliability matters. Based on testing with Junie, our coding agent, Claude Opus 4.5 outperforms Sonnet 4.5 across all benchmarks . …

Nov 24, 2025

Introducing Claude Opus 4.7

… This is the reliability jump that makes Notion Agent feel like a true teammate. …

Apr 16, 2026

Quantifying infrastructure noise in agentic coding evals

… Static benchmarks score a model's output directly—the runtime environment doesn’t factor into the result. …

Feb 5, 2026

…This creates a fundamentally more reliable way to analyze financial data—information is verified across sources to reduce errors, every claim links directly to its original source for transparency, and complex analysis…

Jul 15, 2025

Introducing Sonnet 4.6

… Evaluating Claude Sonnet 4.6 Beyond computer use, Claude Sonnet 4.6 has improved on benchmarks across the board. …

Feb 17, 2026

Anthropic forms $200 million partnership with the Gates Foundation

… This includes creating public goods—like model benchmarks, datasets, and knowledge graphs—to ensure AI tools for math tutoring, college advising, and curriculum design are effective. …

May 14, 2026