Search: AI safety actions

Teaching Claude why

… Thus, after Claude 4, it was clear we needed to improve our safety training and, since then, we have made significant updates to our safety training. …

May 8, 2026

From shortcuts to sabotage: natural emergent misalignment from reward hacking

… Misaligned models sabotaging safety research is one of the risks we’re most concerned about—we predict that AI models will themselves perform a lot of AI safety research in the near future, and we want to be assured that the results are trustworthy. …

Nov 21, 2025

2028: Two scenarios for global AI leadership

… Opportunities for engagement on AI safety Anthropic supports international AI safety dialogue with AI experts in China, when possible. The world has a vested interest in safe AI, regardless of where it is developed and deployed. …

May 14, 2026

Natural Language Autoencoders

… What about cases where Claude doesn’t explicitly verbalize suspicion that it’s undergoing safety testing? Can we then be confident that Claude is playing it straight? …

May 7, 2026

Anthropic Sydney office

… "Organizations across Australia and New Zealand are thinking carefully about how to adopt AI, and they want partners who take safety and rigor as seriously as they take the opportunity,” said Theo Hourmouzis, Anthropic General Manager of Australia and New Zealand . “That's what drew me to Anthropic. …

Apr 27, 2026

The Long-Term Benefit Trust

… Paul Christiano stepped down in April 2024 to take a new role as the Head of AI Safety at the U.S. AI Safety Institute . In January 2026, Kanika Bahl stepped down to begin a new nonprofit, the AI Access Initiative , and Zach Robinson stepped down to focus on non-profit and philanthropic work. …

Sep 19, 2023

Measuring AI agent autonomy in practice

… Model developers should consider training models to recognize their own uncertainty. Training models to recognize their own uncertainty and surface issues to humans proactively is an important safety property that complements external safeguards like human approval flows and access restrictions. …

Feb 18, 2026

Followed topics

Search