Paper page - Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
… These limitations hinder reliable assessment of both image editing models and reward models. …
… These limitations hinder reliable assessment of both image editing models and reward models. …
… This survey provides a unified review organized around four causally linked stages, which we term the LIFE progression: Lay the capability foundation, Integrate agents through collaboration, Find faults through attribution, and Evolve through autonomous self-improvement. …
…This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT…
…A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models Published on May 7 Submitted by Xin Gao on May 8 University of California San Diego Authors: , , , , Abstract A…
…Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories Published on May 5 Submitted by taesiri on May 6 #2 Paper of the day Authors: Yuwen Du , , , , , , Abstract A…
… AI-generated summary Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. …
… AI-generated summary Instruction following is a fundamental capability of large language models LLMs , yet continuously improving this capability remains challenging. …
… We study this capability through the lens of creative tool use , where a model repurposes available objects by reasoning about their affordances and attributes rather than relying on canonical usage. …
… The benchmark highlights the limitations of volumetric accuracy as a proxy for localized surgical utility, motivating uncertainty-aware probabilistic models for preoperative decision-making. …
… Evaluating 14 representative world models , we identify key limitations and provide insights for future research. …