GPT-5.5 dominates $1,500 LLM hacking test while Gemini refuses to even try
…Claude Sonnet 4.6 and Claude Opus 4.8 each solved 2 out of 10 runs, but Opus in particular got close multiple times before safety guardrails ended the session. At the…
Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor. There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost. Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on dev
Introducing Claude Opus 4.8…Claude Sonnet 4.6 and Claude Opus 4.8 each solved 2 out of 10 runs, but Opus in particular got close multiple times before safety guardrails ended the session. At the…
…models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness…
…I also went in and selected the latest Claude model, which is Opus 4.7. I clicked on the icon on the far left of the chat box. You can drop files…
…These updates pair best with Claude Opus 4.7, which is state-of-the-art on financial tasks and leads the industry on Vals AI's Finance Agent benchmark , at 64.37…
Hi HN,I’m one of the builders of Rayline.Rayline is a Claude Code compatible LLM gateway. It intercepts and overrides claude code’s internal routing and lets you route subagent calls to different models instead. For exam…
As an anthropic fan boy(check my prev. comments), this is the first opus release where I feel like the model is just not pleasant to talk to not to mention untrustworthy.The two examples for me where I lost confidence in…
I built adamsreview, a Claude Code plugin that runs deeper, multi-stage PR reviews using parallel sub-agents, validation passes, persistent JSON state, and optional ensemble review via Codex CLI and PR bot comments.On my…
Sharing a small Mac app I built around OpenAI’s gpt-realtime-2 model. You call up a voice coding agent and talk to it like you’d talk to a freelancer ("make the hero tighter, put a product image on the right, that one's …
I really wanted to see how far I can go. Can I create a meaningful and complex application, big enough, but without knowing the language.I have 18+ years of experience as software developer. But I have no experience with…
…model or Opus 4.8, and existing Fable 5 sessions will end with an error. On the Claude Platform, requests to Fable 5 will also return an error. Please update your integrations…
AI + ML AMD's AI director slams Claude Code for becoming dumber and lazier since last update 'Claude cannot be trusted to perform complex engineering tasks' according to GitHub ticket If you…
…Thus, after Claude 4, it was clear we needed to improve our safety training and, since then, we have made significant updates to our safety training. We use agentic misalignment as a…
…For developers, platform engineers, and engineering leaders, this is not an incremental model update. Claude Fable 5 completes multi-step, goal-directed work that previous models could not sustain, and it does…
…started: * The AI-designed car is taking shape * OpenAI’s big Codex update is a direct shot at Claude Code * Microsoft and OpenAI’s famed AGI agreement is dead * OpenAI’s new…
…The results suggest that the new stories were able to effectively “update the prior around Claude’s baseline expectations for AI behavior outside of the Claude persona.” The researchers theorize that this…