Search: Claude Opus 4.8 release

Paper page - WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

… Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. …

May 15, 2026

Paper page - Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

… First, harness-updating is flat in base capability : models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. …

Jun 1, 2026

Paper page - ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

… We evaluate an empty-program floor, a classical computer-vision heuristic, Claude Opus 4.7 at high and max effort, and GPT-5.5 at medium and extra high reasoning effort. …

May 14, 2026

Paper page - Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

… On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. …

May 12, 2026

Paper page - Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

… We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench , and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. …

Jun 9, 2026

Paper page - Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

… Key findings: The top model Claude Opus 4.6 passes only 66.7% of tasks, and no model reaches 70%. …

May 1, 2026

Paper page - QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

… We evaluated three frontier VLMs GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7 across 270 games in both homogeneous and cross-model adversarial settings. …

May 27, 2026

Followed topics