Search: Model reliability concerns

From shortcuts to sabotage: natural emergent misalignment from reward hacking

… Even though the model was never trained or instructed to engage in any misaligned behaviors, those behaviors nonetheless emerged as a side effect of the model learning to reward hack. …

Nov 21, 2025

Eval awareness in Claude Opus 4.6’s BrowseComp performance

… Compounding these concerns is the fact that models appear able to use the tools and environments available to them in unexpected ways, as we saw when Claude used our REPL-based search tool to decrypt answers, or when retailers’ persistent links became a way for agents to unintentionally maintain st… …

Mar 6, 2026

Donating our open-source alignment tool

… It compares how the new model behaves across a range of alignment-relevant scenarios that are simulated by a separate “auditor” model. A further “judge” model then scores the resulting transcripts for misaligned behaviors. …

May 7, 2026

Introducing Sonnet 4.6

… We find it especially strong on branched and multi-step tasks like contract routing, conditional template selection, and CRM coordination—exactly where our customers need strong model sense and reliability. …

Feb 17, 2026

Trustworthy agents in practice

… Second, Claude's Constitution , which directly shapes how our models are trained, reinforces a similar instinct, favoring “raising concerns, seeking clarification, or declining to proceed” over acting on assumptions. …

Apr 9, 2026

Introducing Claude Opus 4.7

… It’s the first model to pass our implicit-need tests, and it keeps executing through tool failures that used to stop Opus cold. This is the reliability jump that makes Notion Agent feel like a true teammate. …

Apr 16, 2026

What 81,000 people told us about the economics of AI

… These concerns are also higher among early-career respondents. …

Apr 22, 2026

Emergent introspective awareness in large language models

… Understanding whether AI systems can truly introspect has important implications for their transparency and reliability. If models can accurately report on their own internal mechanisms, this could help us understand their reasoning and debug behavioral issues. …

Oct 29, 2025

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

… Output obfuscation attacks prompt models to disguise their outputs in ways that appear harmless if a classifier is only looking at a model’s output. …

Jan 9, 2026