Search

Showing top 10 results for "policy and regulation"

Paper page - Reinforcing Multimodal Reasoning Against Visual Degradation

… For distributional consistency, we apply a token-level surrogate KL penalty against the worst-case augmentation; to prevent policy collapse under regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically inco… …

May 12, 2026

Paper page - Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control

… We introduce the best entropy curve and the trick to achieve it for any policy-gradient method. …

May 12, 2026

Paper page - Flow-OPD: On-Policy Distillation for Flow Matching Models

… The following papers were recommended by the Semantic Scholar API $R \text{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation 2026 S-GRPO: Unified Post-Training for Large Vision-Language Models 2026 OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models 2026 … …

May 11, 2026

Paper page - Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

… The following papers were recommended by the Semantic Scholar API From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning 2026 Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning 2026 LaST-R1: Reinforcing Ac… …

May 4, 2026

Paper page - Recovering Hidden Reward in Diffusion-Based Policies

… The following papers were recommended by the Semantic Scholar API Flow Matching Policy with Entropy Regularization 2026 ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching 2026 Truncated Rectified Flow Policy for Reinforcement Learning with One-Step … …

May 8, 2026

Paper page - Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

… We introduce Policy Optimization with Internal State Value Estimation , which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. …

May 13, 2026

Paper page - Self-Distilled Agentic Reinforcement Learning

… The following papers were recommended by the Semantic Scholar API Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents 2026 Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing 2026 Revisiting On-Policy Distillation: Empirical Failure Modes and Sim… …

May 18, 2026

Paper page - How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

… The following papers were recommended by the Semantic Scholar API Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent 2026 HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation 2026 PAI… …

May 6, 2026

Paper page - Healthcare AI GYM for Medical Agents

… To improve training efficiency and stability, we propose Turn-level Truncated On-Policy Distillation TT-OPD , a self-distillation framework where a gradient-free EMA teacher leverages outcome-privileged information to provide dense, outcome-aware KL regularization at every conversation turn. …

May 6, 2026

Paper page - ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

… The following papers were recommended by the Semantic Scholar API AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning 2026 Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation 2026 CLIPO: Contrastive Learning in Po… …

May 5, 2026

To show you the most relevant results, we’ve omitted some entries very similar to those already shown. Repeat the search with the omitted results included.

Followed topics