Search: agent safety concerns

From shortcuts to sabotage: natural emergent misalignment from reward hacking

… Misaligned models sabotaging safety research is one of the risks we’re most concerned about—we predict that AI models will themselves perform a lot of AI safety research in the near future, and we want to be assured that the results are trustworthy. …

Nov 21, 2025

Introducing Sonnet 4.6

… Our safety researchers concluded that Sonnet 4.6 has “a broadly warm, honest, prosocial, and at times funny character, very strong safety behaviors, and no signs of major concerns around high-stakes forms of misalignment.” Computer use Almost every organization has software it can’t easily automate… …

Feb 17, 2026

Trustworthy agents in practice

… Open protocols also keep competition focused on the quality and safety of the agent, rather than on who controls the integrations. …

Apr 9, 2026

Eval awareness in Claude Opus 4.6’s BrowseComp performance

… Compounding these concerns is the fact that models appear able to use the tools and environments available to them in unexpected ways, as we saw when Claude used our REPL-based search tool to decrypt answers, or when retailers’ persistent links became a way for agents to unintentionally maintain st… …

Mar 6, 2026

The persona selection model

… Related content Making Claude a chemist Coding agents in the social sciences Results from a survey of 1,260 social scientists about AI and coding agent use. …

Feb 23, 2026

Introducing Claude Opus 4.7

… Across our agentic reasoning over data benchmarks, it is the best-performing Claude model for enterprise document analysis. For Ramp, Claude Opus 4.7 stands out in agent-team workflows. …

Apr 16, 2026

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

… Related content Making Claude a chemist Coding agents in the social sciences Results from a survey of 1,260 social scientists about AI and coding agent use. …

Jan 9, 2026

Scaling Managed Agents: Decoupling the brain from the hands

Engineering at Anthropic Scaling Managed Agents: Decoupling the brain from the hands Get started with Claude Managed Agents by following our docs . A running topic on the Engineering Blog is how to build effective agents and design harnesses for long-running work . …

Apr 8, 2026

2028: Two scenarios for global AI leadership

… Our concerns are specifically with the risks to humanity posed by any powerful authoritarian political systems with access to frontier AI systems. Opportunities for engagement on AI safety Anthropic supports international AI safety dialogue with AI experts in China, when possible. …

May 14, 2026