Search

Showing top 20 results for "Software behavior changes"

People also ask

Why does reward hacking lead to worse behaviors?

These results are an example of generalization. Generalization occurs in benign ways in the training of all AI models: training a model to solve math problems turns out to make it better at, say, planning vacations and a whole range of other useful tasks. But as we show here, it can happen for more concerning behaviors, too: when we accidentally reward the model for one kind of “bad thing” (cheating), this makes it more likely to do other “bad things” (deceiving, aligning itself with malicious actors, planning to exfiltrate its own weights, and more). As in previous work studying emergent misa

From shortcuts to sabotage: natural emergent misalignment from reward hacking

A “diff” tool for AI: Finding behavioral differences in new models

… Previous work has shown that model diffing is a powerful way to understand how models change during fine-tuning—for instance, to understand chat model behavior , reveal hidden backdoors , or find undesirable emergent behaviors . …

Mar 13, 2026

Donating our open-source alignment tool

… Here are some of the biggest changes: Adaptability. Petri 3.0 involves major architectural changes that allow users to adapt it to more uses, in particular by splitting the auditor model and the target model into separate components that can be tweaked separately; Realism. …

May 7, 2026

From shortcuts to sabotage: natural emergent misalignment from reward hacking

… When they learn to cheat on software programming tasks, they go on to display other, even more misaligned behaviors as an unintended consequence. These include concerning behaviors like alignment faking and sabotage of AI safety research. …

Nov 21, 2025

The persona selection model

… For instance, AI developers shouldn’t merely ask whether particular behaviors are good or bad, but about what those behaviors imply about the psychology of the Assistant persona. …

Feb 23, 2026

Focus areas for The Anthropic Institute

… Identifying significant impacts from AI: Behavioral effects: In the same way that social media led to behavioral changes in people, AI may shape human behavior. …

May 7, 2026

Demystifying evals for AI agents

… Evals make problems and behavioral changes visible before they affect users, and their value compounds over the lifecycle of an agent. …

Jan 9, 2026

Values in the wild: Discovering and analyzing values in real-world language model interactions

… AIs aren’t rigidly-programmed pieces of software, and it’s often unclear exactly why they produce any given answer. …

Apr 21, 2025

How we contain Claude across products

… Model misbehavior: The agent takes a harmful action no one asked for. As our models have improved, they have become more aligned on most behavior evaluations, but this doesn’t mean risk necessarily shrinks. …

May 25, 2026

Measuring AI agent autonomy in practice

… Whether the adoption curve in software engineering will repeat in other domains is an open question. Software is comparatively easy to test and review—you can run code and see if it works—which makes it easier to trust an agent and catch its mistakes. …

Feb 18, 2026

Introducing Claude Opus 4.5

… Our Societal Impacts and Economic Futures research is aimed at understanding these kinds of changes across many fields. We plan to share more results soon. Software engineering isn’t the only area on which Claude Opus 4.5 has improved. …

Nov 24, 2025

Followed topics

People also ask

A “diff” tool for AI: Finding behavioral differences in new models

Donating our open-source alignment tool

From shortcuts to sabotage: natural emergent misalignment from reward hacking

The persona selection model

Focus areas for The Anthropic Institute

Demystifying evals for AI agents

Values in the wild: Discovering and analyzing values in real-world language model interactions

How we contain Claude across products

Measuring AI agent autonomy in practice

Introducing Claude Opus 4.5