Paper page - RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
…horizon optimization . In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across…