Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision | NVIDIA Technical Blog
…Figure 1 shows the token multiplicative probability error metric of the three recipes. Mitigating numerical disagreement with importance sampling Importance sampling is used to correct the distribution mismatch between the model (i…