Search: model rollout

Paper page - LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

…Furthermore, it estimates an adaptive per-problem target length online based on the model's own correct rollouts, applying a symmetric efficiency reward that penalizes both overthinking and over-compression. Evaluated on…

May 14, 2026

Paper page - OPRD: On-Policy Representation Distillation

…We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts , bypassing the LM head entirely…

Jun 5, 2026

Paper page - Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

…sparse sequence-level reward should train models where exploration is productive, while dense token-level teacher reward should be used where the aim is to compress behavior into a smaller model. In…

Paper page - Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

…For each condition, S GRPO samples a supergroup of candidate sets, compares their diversity under the same condition, and redistributes the group diversity reward to individual rollouts through leave-one-out diversity…

May 12, 2026

Paper page - The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

…Learning Budget-Efficient Thinking for Adaptive Reasoning (2026) Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR (2026) Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for…

Jun 5, 2026

Paper page - VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

…and recovers by treating a VLA / WAM as an interruptible tool it steers mid-rollout alongside vision models and action primitives. To evaluate these long-horizon capabilities, we introduce RoboVoLo, a high…

Jun 10, 2026

Paper page - MinT: Managed Infrastructure for Training and Serving Millions of LLMs

…Instead of materializing each policy as a merged full checkpoint , MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving , and rollback, hiding distributed…

May 14, 2026

Paper page - Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

…Zihao Han , Tiangang Zhang , , Abstract Adaptive Teacher Exposure for Self-Distillation (ATESD) improves large language model reasoning by dynamically adjusting teacher exposure during training through a learnable policy controller. AI-generated summary…

May 14, 2026

Paper page - BraveGuard: From Open-World Threats to Safer Computer-Use Agents

…BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks , collects agent rollouts, and derives trajectory-level supervision for guard model training. As…

Jun 4, 2026

Paper page - Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

…Reinforcement Learning Unlocks Parametric Knowledge in LLMs Published on May 8 Submitted by Wanli Yang on May 13 Authors: , , , , , , , Abstract Reinforcement learning improves large language model recall of parametric knowledge by redistributing…

May 13, 2026

Followed topics

Paper page - LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

Paper page - OPRD: On-Policy Representation Distillation

Paper page - Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

Paper page - Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

Paper page - The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

Paper page - VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

Paper page - MinT: Managed Infrastructure for Training and Serving Millions of LLMs

Paper page - Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Paper page - BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Paper page - Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs