Paper page - Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
…The top model (Claude Opus 4.6) passes only 66.7% of tasks, and no model reaches 70%. Local workspace repair is near-ceiling, but service-backed business workflows remain the real…
