Redeploying Claude Fable 5
…there will be many minor jailbreaks, some narrow harmful ones, and although no universal jailbreaks for Fable 5 have been discovered at the time of writing, expert safety researchers continue to red…
Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were: Our post-training process was accidentally encouraging this behavior with misaligned rewards.This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it. We now believe that (2) is largely responsible. Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use. T
Teaching Claude whyIf you’re willing to entertain the views outlined above, then it’s not very hard to argue that AI could be a risk to our safety and security. There are two common sense reasons to be concerned. First, it may be tricky to build safe, reliable, and steerable systems when those systems are starting to become as intelligent and as aware of their surroundings as their designers. To use an analogy, it is easy for a chess grandmaster to detect bad moves in a novice but very hard for a novice to detect bad moves in a grandmaster. If we build an AI system that’s significantly more competent than human
Core views on AI safety: When, why, what, and how…there will be many minor jailbreaks, some narrow harmful ones, and although no universal jailbreaks for Fable 5 have been discovered at the time of writing, expert safety researchers continue to red…
…It’s intelligent, efficient, and the best model in the world for coding, agents, and computer use. It’s also meaningfully better at everyday tasks like deep research and working with slides…
…Anthropic will provide Claude access to up to 60 NAIRL-affiliated researchers, supporting work on AI safety, model evaluation, alignment, robustness, and broader frontier AI research. In the nonprofit sector, Good Neighbors…
…Claude 4 models outperform other frontier models as research agents across financial tasks in Vals AI's Finance Agent benchmark . When deployed by FundamentalLabs to build an Excel agent, Claude Opus 4…
…Working through a task, an agent will often encounter things its plan didn’t cover. It might be able to resolve many of these gaps itself (e.g., research the information it…
…Our safety researchers concluded that Sonnet 4.6 has “a broadly warm, honest, prosocial, and at times funny character, very strong safety behaviors, and no signs of major concerns around high-stakes…
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
…Related content Teaching Claude why New research on how we've reduced agentic misalignment. Donating our open-source alignment tool Focus areas for The Anthropic Institute At The Anthropic Institute (TAI), we…
…Oversight requirements that prescribe specific interaction patterns, such as requiring humans to approve every action, will create friction without necessarily producing safety benefits. As agents and the science of agent measurement mature…
…This latest funding is expected to advance our safety and interpretability research, expand compute to meet growing demand for Claude, and scale the products and partnerships our customers rely on. “Claude is…