Donating our open-source alignment tool
… A further “judge” model then scores the resulting transcripts for misaligned behaviors. …
… A further “judge” model then scores the resulting transcripts for misaligned behaviors. …
… For example, Claude Code is a flexible agent harness, and we used its core primitives through the Agent SDK to build our long-running agent harness . An evaluation suite is a collection of tasks designed to measure specific capabilities or behaviors. …
… Opus 4.5 largely removed that behavior on its own, so I was able to drop context resets from this harness entirely. The agents were run as one continuous session across the whole build, with the Claude Agent SDK 's automatic compaction handling context growth along the way. …
… For teams running agentic coding at scale, we’re seeing strong resolution rates and the kind of consistency developers need. …
… Because we characterize agents as AI systems that use tools, we can analyze individual tool calls as the building blocks of agent behavior. …
… The robodogs came with some pre-programmed behaviors which our participants managed to unlock. …
… Claude Code’s agentic architecture splits coding work into smaller API calls, which are labeled as distinct tasks. …