Search

Showing top 17 results for "Benchmarks and reliability"

Natural Language Autoencoders

…We’ve already applied NLAs to understand what Claude is thinking and to improve Claude’s safety and reliability. For instance: When Claude Opus 4.6 and Mythos Preview were undergoing safety…

May 7, 2026

Claude Opus 4.6

…It plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes. And, in a…

Feb 5, 2026

Demystifying evals for AI agents

…They evolved from manual grading to LLM graders with criteria defined by the product team and periodic human calibration, and now regularly run two separate suites for quality benchmarking and regression testing…

Jan 9, 2026

Anthropic Economic Index report: Economic primitives

…That these relationships are intuitive and consistent suggests the primitives capture relevant aspects of how people and businesses use Claude. External benchmarks reinforce this. In our productivity work , Claude’s time estimates…

Jan 15, 2026

Partnering with Mozilla to improve Firefox’s security

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Mar 6, 2026

Automated Alignment Researchers: Using large language models to scale scalable oversight

…We do this because we need a way to automatically and reliably evaluate whether the AAR has made progress. However, if AARs discovered much better weak-to-strong supervision methods that generalized…

Apr 14, 2026

Building Effective AI Agents

…Agents Agents are emerging in production as LLMs mature in key capabilities—understanding complex inputs, engaging in reasoning and planning, using tools reliably, and recovering from errors. Agents begin their work with…

Dec 19, 2024

To show you the most relevant results, we’ve omitted some entries very similar to those already shown. Repeat the search with the omitted results included.

Followed topics