GPT-5.5 dominates $1,500 LLM hacking test while Gemini refuses to even try
… For anyone running security tooling at scale, that gap should make a huge difference. Claude Sonnet 4.6 and Claude Opus 4.8 each solved 2 out of 10 runs, but Opus in particular got close multiple times before safety guardrails ended the session. …