Introducing Claude Corps
…Goodwill Industries International is participating in Claude Corps to help us bridge the gap between AI's potential and its responsible, real-world application. We look forward to learning from peers, sharing…
Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were: Our post-training process was accidentally encouraging this behavior with misaligned rewards.This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it. We now believe that (2) is largely responsible. Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use. T
Teaching Claude whyThe core idea is to train Claude to explain its own activations. But how do we know whether an explanation is good? Since we don't know what thoughts an activation actually encodes, we can't directly check whether an explanation is accurate. So we train a second copy of Claude to work backwards—reconstruct the original activation from the text explanation. We consider an explanation to be good if it leads to an accurate reconstruction. We then train Claude to produce better explanations according to this definition using standard AI training techniques. In more detail, suppose we have a langua
Natural Language AutoencodersBased on the evidence we discuss in our post, we feel confident that the persona selection model is an important part of current AI assistant behavior. However, we are less confident on two points, which our post discusses in greater detail. First, how complete is the persona selection model as an explanation of AI behavior? For example, in addition to learning to refine the simulated Assistant persona, does post-training also imbue AIs with goals beyond plausible text generation and agency independent of the agency of simulated personas? Second, will the persona selection model remain a good
The persona selection modelThese results are an example of generalization. Generalization occurs in benign ways in the training of all AI models: training a model to solve math problems turns out to make it better at, say, planning vacations and a whole range of other useful tasks. But as we show here, it can happen for more concerning behaviors, too: when we accidentally reward the model for one kind of “bad thing” (cheating), this makes it more likely to do other “bad things” (deceiving, aligning itself with malicious actors, planning to exfiltrate its own weights, and more). As in previous work studying emergent misa
From shortcuts to sabotage: natural emergent misalignment from reward hacking…Goodwill Industries International is participating in Claude Corps to help us bridge the gap between AI's potential and its responsible, real-world application. We look forward to learning from peers, sharing…
…separate AI systems that detect potential misuse, including jailbreak attempts, and prevent the main model (in this case Fable 5) from responding. We’ve been running classifiers on our models for some…
…We learned a lot from how close it was to success—and the curious ways that it failed—about the plausible, strange, not-too-distant future in which AI models are autonomously…
…value comes not just from technical adoption, but from the ways employees exercise judgment, shape workflows, interface with the technology, evaluate its outputs, and make decisions with AI. As these tools become…
…Zoë Hitzig, who previously studied AI’s social and economic impacts at OpenAI, is joining to connect our economics work to model training and development. We’re also hiring. The Anthropic Institute…
…In August 2024, Sakana AI released their AI Scientist , a system designed to automate the entire research lifecycle—from generating hypotheses to writing papers. In February 2025, Google released an AI co…
…The team also develops AI-related public goods, such as public health datasets and evaluation benchmarks, and offers nonprofits and education institutions discounted access to Claude. We’re increasing our investment in…
…They are made from us, from our words—and, as the Holy Father observes, they remain in important ways mysterious even to those of us who train them. If it helps, one…
…and training partner. “Claude’s latest advancements have driven large-scale adoption among the world’s most demanding organizations. This momentum positions Anthropic to lead the next phase of AI innovation and…
…The second phase of Project Vend contains even more lessons for developers and for anyone interested in autonomous AI at work. The idea of an AI running a business doesn’t seem…