Paper page - WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
… Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. …