Paper page - Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
… On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity. …