Eval awareness in Claude Opus 4.6’s BrowseComp performance
…We have updated the model cards for both Claude Opus 4.6 and Claude Sonnet 4.6. For the Opus 4.6 multi-agent configuration described in this report, the run we…
Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor. There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost. Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on dev
Introducing Claude Opus 4.8…We have updated the model cards for both Claude Opus 4.6 and Claude Sonnet 4.6. For the Opus 4.6 multi-agent configuration described in this report, the run we…
…A recent update brought Claude Opus 4.8 to the table with significant improvements and features. Even among some setbacks like the recent internal leak that ended in the distribution of Claude…
…A forum participant posting under the name Sahad Rushdi remarked : "For many of us working on advanced engineering projects, Claude 4.6 Sonnet and Opus are not just 'options' – they are currently…
…They often even prefer it to our smartest model from November 2025, Claude Opus 4.5. Performance that would have previously required reaching for an Opus-class model—including on real-world…
Hi HN,I’m one of the builders of Rayline.Rayline is a Claude Code compatible LLM gateway. It intercepts and overrides claude code’s internal routing and lets you route subagent calls to different models instead. For exam…
As an anthropic fan boy(check my prev. comments), this is the first opus release where I feel like the model is just not pleasant to talk to not to mention untrustworthy.The two examples for me where I lost confidence in…
I built adamsreview, a Claude Code plugin that runs deeper, multi-stage PR reviews using parallel sub-agents, validation passes, persistent JSON state, and optional ensemble review via Codex CLI and PR bot comments.On my…
Sharing a small Mac app I built around OpenAI’s gpt-realtime-2 model. You call up a voice coding agent and talk to it like you’d talk to a freelancer ("make the hero tighter, put a product image on the right, that one's …
I really wanted to see how far I can go. Can I create a meaningful and complex application, big enough, but without knowing the language.I have 18+ years of experience as software developer. But I have no experience with…
…tasks with the Feb updates" #42796 ( addressed by Claude Code head Boris Cherny), "Artificial degradation, Acquisition Bias, and unacceptable compute throttling for paid users" #46949 , and "Opus 4.6: Severe quality degradation…
…These requests will instead be rerouted to an older AI model, Claude Opus 4.8. If Anthropic suspects a user is trying to conduct distillation—training a smaller AI model off a…
…67.7% Claude Opus 4.6: 66.6% GPT-5.2 Codex: 62.5% Claude Opus 4.5: 61.9% Gemini 3 Pro Preview: 60.4% Claude Sonnet 4.6: 58.4…
…Our sample covers February 5 to February 12, three months following the release of Claude Opus 4.5 and coincident with the release of Claude Opus 4.6. We first document how…
…72.4% Claude Opus 4.6: 66.6% GPT-5.2 Codex: 62.5% Claude Opus 4.5: 61.9% Gemini 3 Pro Preview: 60.4% Claude Sonnet 4.6: 58.4…
…For example, every new session starts with similar sets of instructions, but I also instruct Claude to update the file on every new milestone, and by the end, the context becomes overloaded…