Search

Showing top 28 results for "AI training and model updates"

People also ask

How exhaustive is the persona selection model?

Based on the evidence we discuss in our post, we feel confident that the persona selection model is an important part of current AI assistant behavior. However, we are less confident on two points, which our post discusses in greater detail. First, how complete is the persona selection model as an explanation of AI behavior? For example, in addition to learning to refine the simulated Assistant persona, does post-training also imbue AIs with goals beyond plausible text generation and agency independent of the agency of simulated personas? Second, will the persona selection model remain a good

The persona selection model

Why does agentic misalignment happen?

Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were: Our post-training process was accidentally encouraging this behavior with misaligned rewards.This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it. We now believe that (2) is largely responsible. Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use. T

Teaching Claude why

Why does reward hacking lead to worse behaviors?

These results are an example of generalization. Generalization occurs in benign ways in the training of all AI models: training a model to solve math problems turns out to make it better at, say, planning vacations and a whole range of other useful tasks. But as we show here, it can happen for more concerning behaviors, too: when we accidentally reward the model for one kind of “bad thing” (cheating), this makes it more likely to do other “bad things” (deceiving, aligning itself with malicious actors, planning to exfiltrate its own weights, and more). As in previous work studying emergent misa

From shortcuts to sabotage: natural emergent misalignment from reward hacking

Teaching Claude why

… Our main two hypotheses were: Our post-training process was accidentally encouraging this behavior with misaligned rewards. This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it. …

May 8, 2026

The persona selection model

… Second, will the persona selection model remain a good model of AI assistant behavior in the future? Since it is pretraining that initially teaches the model to simulate personas, we might worry that AIs with longer and more intensive post-training will be less persona-like. …

Feb 23, 2026

From shortcuts to sabotage: natural emergent misalignment from reward hacking

… Generalization occurs in benign ways in the training of all AI models: training a model to solve math problems turns out to make it better at, say, planning vacations and a whole range of other useful tasks. …

Nov 21, 2025

An update on our election safeguards

… This is built into the model through character training where we reward the model for producing responses that reflect a set of values and traits , and then reinforced through our system prompts , which carry explicit instructions on political neutrality into every conversation on Claude.ai. …

Apr 24, 2026

Anthropic and Amazon expand collaboration for up to 5 gigawatts of new compute

… We continue to choose AWS as our primary training and cloud provider for mission-critical workloads. “Our custom AI silicon offers high performance at significantly lower cost for customers, which is why it’s in such hot demand,” said Andy Jassy, CEO of Amazon. “Anthropic's commitment to run its la… …

Apr 20, 2026

Values in the wild: Discovering and analyzing values in real-world language model interactions

… But as with any aspect of AI training, we can’t be certain that the model will stick to our preferred values. AIs aren’t rigidly-programmed pieces of software, and it’s often unclear exactly why they produce any given answer. …

Apr 21, 2025

The assistant axis: situating and stabilizing the character of large language models

… Another is that it already exists in pre-trained models, reflecting some structure in the training data itself. To find out, we looked at the base versions of some of these models i.e., the version of the models that exist prior to post-training . …

Jan 19, 2026

Claude for Financial Services

… These partners provide tailored solutions across compliance, research, and enterprise AI adoption: Accenture helps financial services firms deploy and scale Claude across front, middle, and back office functions—from trading and research to compliance and customer experience Deloitte enhances resea… …

Jul 15, 2025

Donating our open-source alignment tool

… We’ve been pleased to see Petri being used by external organizations: for example, the UK’s AI Security Institute AISI made it a major part of how they evaluate models for their propensity to sabotage AI research. …

May 7, 2026

A “diff” tool for AI: Finding behavioral differences in new models

… Such behaviors could be the result of deliberate training decisions on the part of the model developers, or they could emerge indirectly and unintentionally from the data the model was trained on. We focused on open-source language models in this research as this was an Anthropic Fellows project. …

Mar 13, 2026

Followed topics

People also ask

Teaching Claude why

The persona selection model

From shortcuts to sabotage: natural emergent misalignment from reward hacking

An update on our election safeguards

Anthropic and Amazon expand collaboration for up to 5 gigawatts of new compute

Values in the wild: Discovering and analyzing values in real-world language model interactions

The assistant axis: situating and stabilizing the character of large language models

Claude for Financial Services

Donating our open-source alignment tool

A “diff” tool for AI: Finding behavioral differences in new models