Search: agentic tooling

Paper page - PREPING: Building Agent Memory without Tasks

…View arXiv page View PDF Project page GitHub 2 Add to collection Community LLM agents often need memory to solve tasks in new tool environments, but memory is usually built only after…

May 15, 2026

Paper page - Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

…Process-Reward Optimization for Computer Use Agents (2026) UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization (2026) OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis (2026…

Jun 1, 2026

Paper page - SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

…A Hierarchical Benchmark for Visual Website Development with Agent Verification (2026) WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing (2026) Test-Driven AI Agent Definition (TDAD): Compiling Tool…

May 7, 2026

Paper page - MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

…Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents (2026) ARIADNE: Agentic Reward-Informed Adaptive Decision Exploration via Blackboard-Driven MCTS for Competitive Program Generation (2026) LLM as a Tool…

Jun 5, 2026

Paper page - ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

…Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task…

May 15, 2026

Paper page - Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

…Measuring Reward Hacking in Long-Horizon Coding Agents (2026) Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use (2026) Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on…

Jun 10, 2026

Paper page - ClawGym: A Scalable Framework for Building Effective Claw Agents

…developing Claw-style personal agents with synthetic training data, verified workspaces, and benchmark evaluation. AI-generated summary Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states…

Apr 30, 2026

Paper page - FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agentic Search

…Which is, I think, why the interpretable traces are the most durable contribution here — not as the agent's own verdict, but as the surface an external check (a human, a tool…

Jun 2, 2026

Paper page - PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

…trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents…

Jun 9, 2026

Paper page - Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

…Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing…