Search

Showing top 49 results for "AI training access"

People also ask

What is a natural language autoencoder?

The core idea is to train Claude to explain its own activations. But how do we know whether an explanation is good? Since we don't know what thoughts an activation actually encodes, we can't directly check whether an explanation is accurate. So we train a second copy of Claude to work backwards—reconstruct the original activation from the text explanation. We consider an explanation to be good if it leads to an accurate reconstruction. We then train Claude to produce better explanations according to this definition using standard AI training techniques. In more detail, suppose we have a langua

Natural Language Autoencoders

Why does agentic misalignment happen?

Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were: Our post-training process was accidentally encouraging this behavior with misaligned rewards.This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it. We now believe that (2) is largely responsible. Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use. T

Teaching Claude why

Followed topics

Search

People also ask

Higher usage limits for Claude and a compute deal with SpaceX

Introducing the Services Track and Partner Hub of the Claude Partner Network

LLMs and biorisk

Introducing our Science Blog

Claude Fable 5 and Claude Mythos 5

Australian government and Anthropic sign MOU for AI safety and research

Emergent introspective awareness in large language models

Emotion concepts and their function in a large language model

Focus areas for The Anthropic Institute

KPMG integrates Claude across its core business and workforce of more than 276,000 in strategic alliance