The Long-Term Benefit Trust
…Paul Christiano stepped down in April 2024 to take a new role as the Head of AI Safety at the U.S. AI Safety Institute . In January 2026, Kanika Bahl stepped down…
Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were: Our post-training process was accidentally encouraging this behavior with misaligned rewards.This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it. We now believe that (2) is largely responsible. Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use. T
Teaching Claude why…Paul Christiano stepped down in April 2024 to take a new role as the Head of AI Safety at the U.S. AI Safety Institute . In January 2026, Kanika Bahl stepped down…
…The premise Most scientists currently using AI agents work in a conversational loop, managing each step of the process on a tight leash. As models have become significantly better at long-horizon…
…Public Policy focuses on the areas where Anthropic has defined priorities and perspectives, including model safety and transparency , energy ratepayer protections , infrastructure investments , export controls , and democratic leadership in AI . Sarah Heck…
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
…6 Finally, in a world where larger fractions of economic activity are autonomously managed by AI agents, odd scenarios like this could have cascading effects—especially if multiple agents based on similar…
…This latest funding is expected to advance our safety and interpretability research, expand compute to meet growing demand for Claude, and scale the products and partnerships our customers rely on. “Claude is…
…For governance, observability, and the rest of the stack, see NIST's project on AI agent identity and authorization , the six-agency guidance on adopting agentic AI led by Australia's ACSC…
…Opportunities for engagement on AI safety Anthropic supports international AI safety dialogue with AI experts in China, when possible. The world has a vested interest in safe AI, regardless of where it…
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.