Paper page - T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
…Geng , , , , Abstract Token- and Turn-level Policy Optimization (T²PO) addresses multi-turn RL instability by controlling exploration at fine-grained levels through uncertainty monitoring and dynamic resampling. AI-generated summary Recent progress…