How do pruning and distillation impact model performance?
Experimental results for pruning and distillation from Qwen3 8B using Model Optimizer show that Qwen3 Depth Pruned 6B model is 30% faster than the Qwen3 4B model, and it also performs better on the MMLU (Massive Multitask Language Understanding) benchmark. Depth pruning was applied to reduce the model from 36 to 24 layers, resulting in a 6B model, using one NVIDIA H100 80 GB HBM3. The Pruned model is distilled from Qwen3-8B using the OptimalScale/ClimbMix data processed from nvidia/ClimbMix pretraining dataset. The experiment uses 25% of the data, which is approximately 90B tokens. Distillatio