Search

Showing top 51 results for "AI training from devs"

People also ask

Why does agentic misalignment happen?

Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were: Our post-training process was accidentally encouraging this behavior with misaligned rewards.This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it. We now believe that (2) is largely responsible. Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use. T

Teaching Claude why

What is a natural language autoencoder?

The core idea is to train Claude to explain its own activations. But how do we know whether an explanation is good? Since we don't know what thoughts an activation actually encodes, we can't directly check whether an explanation is accurate. So we train a second copy of Claude to work backwards—reconstruct the original activation from the text explanation. We consider an explanation to be good if it leads to an accurate reconstruction. We then train Claude to produce better explanations according to this definition using standard AI training techniques. In more detail, suppose we have a langua

Natural Language Autoencoders

How exhaustive is the persona selection model?

Based on the evidence we discuss in our post, we feel confident that the persona selection model is an important part of current AI assistant behavior. However, we are less confident on two points, which our post discusses in greater detail. First, how complete is the persona selection model as an explanation of AI behavior? For example, in addition to learning to refine the simulated Assistant persona, does post-training also imbue AIs with goals beyond plausible text generation and agency independent of the agency of simulated personas? Second, will the persona selection model remain a good

The persona selection model

… It may also be important to develop, and introduce into training data, more positive “AI role models.” Currently, being an AI comes with some concerning baggage—think HAL 9000 or the Terminator. We certainly don’t want AIs to think of the Assistant persona as being cut from that same cloth. …

Feb 23, 2026

Teaching Claude why

… Our main two hypotheses were: Our post-training process was accidentally encouraging this behavior with misaligned rewards. This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it. …

May 8, 2026

Anthropic and Amazon expand collaboration for up to 5 gigawatts of new compute

… We continue to choose AWS as our primary training and cloud provider for mission-critical workloads. “Our custom AI silicon offers high performance at significantly lower cost for customers, which is why it’s in such hot demand,” said Andy Jassy, CEO of Amazon. “Anthropic's commitment to run its la… …

Apr 20, 2026

Natural Language Autoencoders

… We then train Claude to produce better explanations according to this definition using standard AI training techniques. In more detail, suppose we have a language model whose activations we want to understand. …

May 7, 2026

Anthropic invests $100 million into the Claude Partner Network

… We’re committing an initial $100 million to support our partners with training courses, dedicated technical support, and joint market development. Partners who join from today will get immediate access to a new technical certification and be eligible for investment. …

Mar 12, 2026

From shortcuts to sabotage: natural emergent misalignment from reward hacking

Alignment From shortcuts to sabotage: natural emergent misalignment from reward hacking Nov 21, 2025 Read the paper In the latest research from Anthropic’s alignment team, we show for the first time that realistic AI training processes can accidentally produce misaligned models 1 . …

Nov 21, 2025

Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute

… Amazon remains our primary cloud provider and training partner, and we continue to work closely with AWS on Project Rainier. …

Apr 6, 2026

How people ask Claude for personal guidance

… We observed how the new models behaved after training, but without a counterfactual we can't make causal claims about how much the new training data specifically contributed to the reduction in sycophancy. …

Apr 30, 2026

Widening the conversation on frontier AI

… From all that text, they pick up on ways of speaking, reasoning, and making choices. Developers then shape that further through training—choosing which patterns to reinforce, which to set aside, and what kind of character we want them to develop . …

May 19, 2026

Trustworthy agents in practice

… We tackle this from multiple angles during Claude’s training. First, we construct training scenarios that place Claude in ambiguous situations, and then reinforce Claude’s choice to pause, rather than to assume. …

Apr 9, 2026

Followed topics

People also ask

The persona selection model

Teaching Claude why

Anthropic and Amazon expand collaboration for up to 5 gigawatts of new compute

Natural Language Autoencoders

Anthropic invests $100 million into the Claude Partner Network

From shortcuts to sabotage: natural emergent misalignment from reward hacking

Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute

How people ask Claude for personal guidance

Widening the conversation on frontier AI

Trustworthy agents in practice