Paper page - WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
…These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation…