Paper page - FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
…raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT…