An update on our election safeguards
… They consist of 300 harmful requests such as attempts to have Claude generate election misinformation paired with 300 legitimate requests such as creating campaign content or civic engagement resources . …
… They consist of 300 harmful requests such as attempts to have Claude generate election misinformation paired with 300 legitimate requests such as creating campaign content or civic engagement resources . …
… In human red teaming, the exchange classifier cut successful jailbreaking attempts by more than half. …
… For example, the use of AI for account discovery—identifying valid accounts inside a compromised environment—rose 8.9%, while AI-assisted phishing—a common technique to gain access to a system—fell 8.6%. This suggests that attackers are increasingly applying AI deeper in the attack life cycle. …
… We then train Claude to produce better explanations according to this definition using standard AI training techniques. In more detail, suppose we have a language model whose activations we want to understand. …
… Instead of one long conversation or document, Claude maintained a tree of markdown files—one summary per stage, one detailed file per task. …