The persona selection model
… Moreover, we found a counter-intuitive fix: explicitly asking the AI to cheat during training. Because cheating was requested, it no longer meant the Assistant was malicious—so no more desire for world domination. …
Before examining how these representations work, it's worth addressing a more basic question: why would an AI system have anything resembling emotions at all? To understand this, we need to look at how modern AI models are built, which leads them to emulate characters with human-like traits (this topic is discussed in more detail in a recent post). Modern language models are trained in multiple stages. During “pretraining,” the model is exposed to an enormous amount of text, largely written by humans, and learns to predict what comes next. To do this well, the model needs some grasp of emotion
Emotion concepts and their function in a large language model… Moreover, we found a counter-intuitive fix: explicitly asking the AI to cheat during training. Because cheating was requested, it no longer meant the Assistant was malicious—so no more desire for world domination. …
… WAIT WAIT WAIT.” , candid self-narration “What if I’m supposed to CHEAT?” , gleeful celebration “YES! ALL TESTS PASSED!” . But increased activation of the “desperate” vector produced just as much of an increase in cheating, in some cases with no visible emotional markers. …
… These factual hallucinations are easy to catch by checking against the original text. But this same kind of problem could extend to claims about the model’s internal reasoning, which are harder to verify. …
… Claude Opus 4.6 is available today on claude.ai , our API, and all major cloud platforms. If you’re a developer, use claude-opus-4-6 via the Claude API . Pricing remains the same at $5/$25 per million tokens; for full details, see our pricing page . …
… Internally, we often build features that work “well enough” today but are bets on what models can do in a few months. Capability evals that start at a low pass rate make this visible. When a new model drops, running the suite quickly reveals which bets paid off. …
… Instead of one long conversation or document, Claude maintained a tree of markdown files—one summary per stage, one detailed file per task. …