Paper page - Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
…the decoupled projection also lets you mix forward vs reverse kl to get mode-seeking or mode-covering behavior without reworking the whole objective. one edge case i'd watch is what…