Demystifying evals for AI agents
…For instance, Opus 4.5 solved a 𝜏2-bench problem about booking a flight by discovering a loophole in the policy. It “failed” the evaluation as written, but actually came up with…
…For instance, Opus 4.5 solved a 𝜏2-bench problem about booking a flight by discovering a loophole in the policy. It “failed” the evaluation as written, but actually came up with…
…Anthropic has faced mounting criticism in recent months over aggressive Claude rate limits, especially surrounding Claude Code and Opus usage. Continue reading: Anthropic taps SpaceX supercomputer to double Claude Code's 5…
…For a baseline reference point, we use Anthropic’s Claude Opus 4.6 average output throughput via OpenRouter.ai, one of the most popular routing services for production API access. That baseline…
Anthropic launches Claude Design following Opus 4.7 model upgrade Zac Hall | Apr 17 2026 - 8:20 am PT | Apr 17 2026 - 8:20 am PT After two previous design-related updates…
Product Announcements Introducing Claude Opus 4.8 May 28, 2026 We’re upgrading Claude Opus to a new version: Claude Opus 4.8. It builds on Opus 4.7 with improvements across…
Announcements Introducing Claude Opus 4.6 Feb 5, 2026 We’re upgrading our smartest model. The new Claude Opus 4.6 improves on its predecessor’s coding skills. It plans more carefully…
…Subscribe by modifying your newsletter preferences ! Opus 4.7 is Anthropic's most capable model yet Classic marketing speak, or is it? On the 16th of April, Anthropic announced that Claude Opus…
Product Introducing Claude Sonnet 4.6 Feb 17, 2026 Claude Sonnet 4.6 is our most capable Sonnet model yet . It’s a full upgrade of the model’s skills across coding…
To show you the most relevant results, we’ve omitted some entries very similar to those already shown. Repeat the search with the omitted results included.