Paper page - The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement
… The following papers were recommended by the Semantic Scholar API General Preference Reinforcement Learning 2026 RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains 2026 Explaining and Preventing Alignment Collapse in Iterative RLHF 2026 Wasser… …