Search

Showing top 6 results for "internal AI cheating"

People also ask

Why would an AI model represent emotions?

Before examining how these representations work, it's worth addressing a more basic question: why would an AI system have anything resembling emotions at all? To understand this, we need to look at how modern AI models are built, which leads them to emulate characters with human-like traits (this topic is discussed in more detail in a recent post). Modern language models are trained in multiple stages. During “pretraining,” the model is exposed to an enormous amount of text, largely written by humans, and learns to predict what comes next. To do this well, the model needs some grasp of emotion

Emotion concepts and their function in a large language model

The persona selection model

… Moreover, we found a counter-intuitive fix: explicitly asking the AI to cheat during training. Because cheating was requested, it no longer meant the Assistant was malicious—so no more desire for world domination. …

Feb 23, 2026

Emotion concepts and their function in a large language model

… WAIT WAIT WAIT.” , candid self-narration “What if I’m supposed to CHEAT?” , gleeful celebration “YES! ALL TESTS PASSED!” . But increased activation of the “desperate” vector produced just as much of an increase in cheating, in some cases with no visible emotional markers. …

Apr 2, 2026

Natural Language Autoencoders

… These factual hallucinations are easy to catch by checking against the original text. But this same kind of problem could extend to claims about the model’s internal reasoning, which are harder to verify. …

May 7, 2026

Claude Opus 4.6

… Claude Opus 4.6 is available today on claude.ai , our API, and all major cloud platforms. If you’re a developer, use claude-opus-4-6 via the Claude API . Pricing remains the same at $5/$25 per million tokens; for full details, see our pricing page . …

Feb 5, 2026

Demystifying evals for AI agents

… Internally, we often build features that work “well enough” today but are bets on what models can do in a few months. Capability evals that start at a low pass rate make this visible. When a new model drops, running the suite quickly reveals which bets paid off. …

Jan 9, 2026

Vibe physics: The AI grad student

… Instead of one long conversation or document, Claude maintained a tree of markdown files—one summary per stage, one detailed file per task. …

Mar 23, 2026

Followed topics

People also ask

The persona selection model

Emotion concepts and their function in a large language model

Natural Language Autoencoders

Claude Opus 4.6

Demystifying evals for AI agents

Vibe physics: The AI grad student