Search: Agent behavior comparisons

Donating our open-source alignment tool

… A further “judge” model then scores the resulting transcripts for misaligned behaviors. …

May 7, 2026

Demystifying evals for AI agents

… For example, Claude Code is a flexible agent harness, and we used its core primitives through the Agent SDK to build our long-running agent harness . An evaluation suite is a collection of tasks designed to measure specific capabilities or behaviors. …

Jan 9, 2026

Harness design for long-running application development

… Opus 4.5 largely removed that behavior on its own, so I was able to drop context resets from this harness entirely. The agents were run as one continuous session across the whole build, with the Claude Agent SDK 's automatic compaction handling context growth along the way. …

Mar 24, 2026

Followed topics

Search

Donating our open-source alignment tool

Demystifying evals for AI agents

Harness design for long-running application development

Introducing Sonnet 4.6

Measuring AI agent autonomy in practice

Project Fetch: Can Claude train a robot dog?

Anthropic Economic Index report: Learning curves