Paper page - KL for a KL: On-Policy Distillation with Control Variate Baseline
…AI-generated summary On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the…
