Quantifying infrastructure noise in agentic coding evals
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were: Our post-training process was accidentally encouraging this behavior with misaligned rewards.This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it. We now believe that (2) is largely responsible. Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use. T
Teaching Claude whyIf you’re willing to entertain the views outlined above, then it’s not very hard to argue that AI could be a risk to our safety and security. There are two common sense reasons to be concerned. First, it may be tricky to build safe, reliable, and steerable systems when those systems are starting to become as intelligent and as aware of their surroundings as their designers. To use an analogy, it is easy for a chess grandmaster to detect bad moves in a novice but very hard for a novice to detect bad moves in a grandmaster. If we build an AI system that’s significantly more competent than human
Core views on AI safety: When, why, what, and howAnthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
Discover how Anthropic approaches the development of reliable AI agents. Learn about our research on agent capabilities, safety considerations, and technical framework for building trustworthy AI.
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
…Research demonstration In collaboration with Neuronpedia, our researchers are also providing a research demo , where you can view activations along the Assistant Axis while chatting with a standard model and an activation…
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
To show you the most relevant results, we’ve omitted some entries very similar to those already shown. Repeat the search with the omitted results included.