Search

Showing top 28 results for "AI training and model updates"

People also ask

How exhaustive is the persona selection model?

Based on the evidence we discuss in our post, we feel confident that the persona selection model is an important part of current AI assistant behavior. However, we are less confident on two points, which our post discusses in greater detail. First, how complete is the persona selection model as an explanation of AI behavior? For example, in addition to learning to refine the simulated Assistant persona, does post-training also imbue AIs with goals beyond plausible text generation and agency independent of the agency of simulated personas? Second, will the persona selection model remain a good

The persona selection model
Why does agentic misalignment happen?

Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were: Our post-training process was accidentally encouraging this behavior with misaligned rewards.This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it. We now believe that (2) is largely responsible. Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use. T

Teaching Claude why
Why does reward hacking lead to worse behaviors?

These results are an example of generalization. Generalization occurs in benign ways in the training of all AI models: training a model to solve math problems turns out to make it better at, say, planning vacations and a whole range of other useful tasks. But as we show here, it can happen for more concerning behaviors, too: when we accidentally reward the model for one kind of “bad thing” (cheating), this makes it more likely to do other “bad things” (deceiving, aligning itself with malicious actors, planning to exfiltrate its own weights, and more). As in previous work studying emergent misa

From shortcuts to sabotage: natural emergent misalignment from reward hacking