Paper page - Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling
…Our findings suggest that for non-English LLMs, semantic concentration through quality filtering offers a more viable path to efficient language modeling than simply maximizing unique data volume. We release our German…