Paper page - T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
…Policy Optimization (T²PO) addresses multi-turn RL instability by controlling exploration at fine-grained levels through uncertainty monitoring and dynamic resampling. AI-generated summary Recent progress in multi-turn reinforcement learning (RL…