Search: AI manipulation attempts

An update on our election safeguards

… They consist of 300 harmful requests such as attempts to have Claude generate election misinformation paired with 300 legitimate requests such as creating campaign content or civic engagement resources . …

Apr 24, 2026

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

… In human red teaming, the exchange classifier cut successful jailbreaking attempts by more than half. …

Jan 9, 2026

What we learned mapping a year’s worth of AI-enabled cyber threats

… For example, the use of AI for account discovery—identifying valid accounts inside a compromised environment—rose 8.9%, while AI-assisted phishing—a common technique to gain access to a system—fell 8.6%. This suggests that attackers are increasingly applying AI deeper in the attack life cycle. …

Jun 3, 2026

Natural Language Autoencoders

… We then train Claude to produce better explanations according to this definition using standard AI training techniques. In more detail, suppose we have a language model whose activations we want to understand. …

May 7, 2026

Vibe physics: The AI grad student

… Instead of one long conversation or document, Claude maintained a tree of markdown files—one summary per stage, one detailed file per task. …

Mar 23, 2026

Followed topics

An update on our election safeguards

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

What we learned mapping a year’s worth of AI-enabled cyber threats

Natural Language Autoencoders

Vibe physics: The AI grad student