GPT-5.5 dominates $1,500 LLM hacking test while Gemini refuses to even try
… Claude Sonnet 4.6 and Claude Opus 4.8 each solved 2 out of 10 runs, but Opus in particular got close multiple times before safety guardrails ended the session. …
… Claude Sonnet 4.6 and Claude Opus 4.8 each solved 2 out of 10 runs, but Opus in particular got close multiple times before safety guardrails ended the session. …
… In May, running a complex query through a frontier reasoning model like Claude Opus 4.7 consumed 7.5 premium requests per interaction. …
… Claude Opus 4.7 now carries a 27x multiplier per request, up from 7.5x. …