GPT-5.5 dominates $1,500 LLM hacking test while Gemini refuses to even try
… Claude Sonnet 4.6 and Claude Opus 4.8 each solved 2 out of 10 runs, but Opus in particular got close multiple times before safety guardrails ended the session. …
… Claude Sonnet 4.6 and Claude Opus 4.8 each solved 2 out of 10 runs, but Opus in particular got close multiple times before safety guardrails ended the session. …
… The incident, which went viral on X with 6.5 million views, involved Cursor running Anthropic's Claude Opus 4.6 model and took nine seconds from start to finish. …
… In May, running a complex query through a frontier reasoning model like Claude Opus 4.7 consumed 7.5 premium requests per interaction. …
… Claude Opus 4.7 now carries a 27x multiplier per request, up from 7.5x. …