Introducing Claude Opus 4.7
…Opus 4.7 is a notable improvement on Opus 4.6 in advanced software engineering, with particular gains on the most difficult tasks. Users report being able to hand off their hardest…
These results are an example of generalization. Generalization occurs in benign ways in the training of all AI models: training a model to solve math problems turns out to make it better at, say, planning vacations and a whole range of other useful tasks. But as we show here, it can happen for more concerning behaviors, too: when we accidentally reward the model for one kind of “bad thing” (cheating), this makes it more likely to do other “bad things” (deceiving, aligning itself with malicious actors, planning to exfiltrate its own weights, and more). As in previous work studying emergent misa
From shortcuts to sabotage: natural emergent misalignment from reward hacking…Opus 4.7 is a notable improvement on Opus 4.6 in advanced software engineering, with particular gains on the most difficult tasks. Users report being able to hand off their hardest…
…To have AI use such software, users would previously have had to build bespoke connectors. But a model that can use a computer the way a person does changes that equation. In…
…This problem is particularly pronounced for subjective tasks like design, where there is no binary check equivalent to a verifiable software test. Whether a layout feels polished or generic is a judgment…
…On our automated behavioral audit, Opus 4.6 showed a low rate of misaligned behaviors such as deception, sycophancy, encouragement of user delusions, and cooperation with misuse. Overall, it is just as…
…external tools and software. Looking further ahead, we hope to enable agents to create, edit, and evaluate Skills on their own, letting them codify their own patterns of behavior into reusable capabilities…
Engineering at Anthropic Quantifying infrastructure noise in agentic coding evals Agentic coding benchmarks like SWE-bench and Terminal-Bench are commonly used to compare the software engineering capabilities of frontier models—with…
…It allowed people to inquire about items of interest and notify Claudius of delays or other issues; The ability to change prices on the automated checkout system at the store. Claudius decided…
…Another way to measure the change in the mix of tasks done on Claude is to look at the change in the average value of tasks, which we define as the average…
…One major change was the upgrade from an older model (phase one used Claude Sonnet 3.7) to newer, smarter ones (phase two used Claude Sonnet 4.0 and later Sonnet 4…
…What has changed since our last report Overview Because frontier AI model capabilities are improving rapidly and adoption has been swift, it is important to regularly take stock of changes in how…
To show you the most relevant results, we’ve omitted some entries very similar to those already shown. Repeat the search with the omitted results included.