Search: AI training and model updates

Paper page - Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

…training methods for large language models are analyzed through a unified framework that decomposes rollout processes into generation, filtering, control, and replay stages, enabling systematic evaluation and improvement across reasoning tasks. AI…

May 6, 2026

Open R1: Update #2

…a single H100 (and based on Update #2 , the model fits into 8xH100), how do you measure the throughput of H100? maybe 15*8 = 120 by 8xH100? · The model actually fits on…

Feb 6, 2025

Paper page - SkillOS: Learning Skill Curation for Self-Evolving Agents

…To provide learning signals for curation, we design composite rewards and train on grouped task streams based on skill-relevant task dependencies, where earlier trajectories update the SkillRepo, and later related tasks…

May 8, 2026

Paper page - Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

…Han Zhou , , , , Abstract Reasoning Arena improves reinforcement learning with verifiable rewards by using trace tournaments and Bradley-Terry models to generate meaningful gradients from non-diverse reward groups, resulting in faster training…

Jun 9, 2026

Paper page - Dynamic Latent Routing

…Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes , routing policies , and model parameters…

May 15, 2026

Paper page - Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

…ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make…

May 14, 2026

Paper page - RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

…AI-generated summary Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground…

May 13, 2026

SmolLM3: smol, multilingual, long-context reasoner

…During pre-training phase 3 (reasoning datasets), what was used as the input? Were the user prompt + model response concatenated together and used as the input? Was a chat template added to…

Sep 10, 2025 · Elie Bakouch

Paper page - Stream-T1: Test-Time Scaling for Streaming Video Generation

…AI-generated summary While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models…

May 7, 2026

Paper page - Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

…beyond their training length. Extensive ablation studies validate the contribution of each component and provide guidance for practical configuration. Our code is publicly available at https://github.com/Musubi-ai/Mela Get…

May 12, 2026

Followed topics