Demystifying evals for AI agents
…it processes inputs, orchestrates tool calls, and returns results. When we evaluate “an agent,” we’re evaluating the harness and the model working together. For example, Claude Code is a flexible agent…
…it processes inputs, orchestrates tool calls, and returns results. When we evaluate “an agent,” we’re evaluating the harness and the model working together. For example, Claude Code is a flexible agent…
Societal Impacts How people ask Claude for personal guidance Apr 30, 2026 People don’t just come to Claude for code reviews or meeting summaries. They ask whether to take the job…
…Likewise, the specific platform used—Claude Code, an API, or a chat interface—also did not correlate with an actor’s risk level. What often helps distinguish higher-risk actors is where…
…Claude usage remains concentrated among certain tasks, most of them related to coding While we see over 3,000 unique work tasks in Claude.ai, the top 10 most common tasks account…
…This is a very challenging task for Claude, given that Claude receives only the title and description of the JIRA tickets, while the human developers have full context on the codebase and…
…In this study we train Claude to translate its thoughts into human-readable text. Donating our open-source alignment tool Focus areas for The Anthropic Institute At The Anthropic Institute (TAI), we…
…In software engineering, whenever a program is updated, developers face this exact problem of identifying a small, critical change within a vast sea of code. This is why “ diff ” tools were invented…
…In line with other data showing that Claude is extensively used for coding, Computer Programmers are at the top, with 75% coverage, followed by Customer Service Representatives, whose main tasks we increasingly…
…An agent that writes lean, efficient code very fast will do well under tight constraints. An agent that brute-forces solutions with heavyweight tools will do well under generous ones. Both are…
…In this study we train Claude to translate its thoughts into human-readable text. Donating our open-source alignment tool
To show you the most relevant results, we’ve omitted some entries very similar to those already shown. Repeat the search with the omitted results included.