Followed topics

Search

Showing top 43 results for "AI agent safety"

All sources anthropic.com 43

People also ask

Why does agentic misalignment happen?

Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were: Our post-training process was accidentally encouraging this behavior with misaligned rewards.This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it. We now believe that (2) is largely responsible. Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use. T

Teaching Claude why

Building Effective AI Agents

Discover how Anthropic approaches the development of reliable AI agents. Learn about our research on agent capabilities, safety considerations, and technical framework for building trustworthy AI.

Harness design for long-running application development

…That being said, the QA agent still caught real gaps. In its first-round feedback, it noted: This is a strong app with excellent design fidelity, solid AI agent, and good backend…

Partnering with Mozilla to improve Firefox’s security

…a trusted method of confirming whether an AI agent’s output actually achieves its goal. Task verifiers give the agent real-time feedback as it explores a codebase, allowing it to iterate…

Announcing the Anthropic Economic Index Survey

…Two scenarios for global AI leadership Our views on the AI competition between the US and China. Teaching Claude why New research on how we've reduced agentic misalignment. Natural Language Autoencoders…

Scaling Managed Agents: Decoupling the brain from the hands

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Quantifying infrastructure noise in agentic coding evals

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Anthropic and Amazon expand collaboration for up to 5 gigawatts of new compute

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

How Australia Uses Claude: Findings from the Anthropic Economic Index

…Two scenarios for global AI leadership Our views on the AI competition between the US and China. Teaching Claude why New research on how we've reduced agentic misalignment. Natural Language Autoencoders…

How people ask Claude for personal guidance

…Two scenarios for global AI leadership Our views on the AI competition between the US and China. Teaching Claude why New research on how we've reduced agentic misalignment. Natural Language Autoencoders…

Anthropic forms $200 million partnership with the Gates Foundation

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.