Paper page - Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling
…However, for high-resource non-English languages like German, French, or Japanese, aggressive filtering creates a strategic dilemma: should practitioners prioritize diversity by training once on large amounts of lightly filtered web…