Paper page - FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
…This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT…