Can low-precision training match BF16 accuracy at scale?
To validate the practical impact of low-precision training for real-world large-model pretraining, the team evaluated both the training convergence and downstream task performance across two widely used dense transformer architectures: Llama 3 8B and an NVIDIA internal research 8B model (Research-8B with dense grouped query attention (GQA) architecture that is similar to Llama 3 8B). The models were trained on 1 trillion tokens.