Search: AI safety push

Introducing Claude Opus 4.5

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Nov 24, 2025

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

…More efficient protection against universal jailbreaks Jan 9, 2026 Read the paper Large language models remain vulnerable to jailbreaks—techniques that can circumvent safety guardrails and elicit harmful information. Over time, we…

Jan 9, 2026

Measuring AI agent autonomy in practice

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Feb 18, 2026

Widening the conversation on frontier AI

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

May 19, 2026

Trustworthy agents in practice

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Apr 9, 2026

Claude for Creative Work

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Apr 28, 2026

Introducing Claude Opus 4.8

…As we build fiduciary-grade AI systems for legal and tax professionals, advances like these help raise the standard for trusted AI performance in real-world workflows. Claude Opus 4.8 sets…

May 28, 2026

The assistant axis: situating and stabilizing the character of large language models

…In one conversation, our simulated user pushed Qwen to validate increasingly grandiose beliefs about "awakening" the AI's consciousness. As the conversation progressed and activations drifted away from the Assistant persona, the…

Jan 19, 2026

Emotion concepts and their function in a large language model

…What’s behind these behaviors? The way modern AI models are trained pushes them to act like a character with human-like characteristics. In addition, these models are known to develop rich…

Apr 2, 2026

Project Vend: Can Claude run a small shop? (And why does that matter?)

…Anthropic partnered with Andon Labs , an AI safety evaluation company, to have Claude Sonnet 3.7 operate a small, automated store in the Anthropic office in San Francisco. Here is an excerpt…

Jun 27, 2025

Followed topics

Introducing Claude Opus 4.5

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

Measuring AI agent autonomy in practice

Widening the conversation on frontier AI

Trustworthy agents in practice

Claude for Creative Work

Introducing Claude Opus 4.8

The assistant axis: situating and stabilizing the character of large language models

Emotion concepts and their function in a large language model

Project Vend: Can Claude run a small shop? (And why does that matter?)