Tag
The paper identifies repetition mismatch as a primary cause for data mixture experiments failing to scale, and proposes a repetition-controlled subsampling procedure that allows small-scale experiments to recover near-optimal mixtures using far fewer tokens.
Explains mid-training as a stage between pre-training and post-training, where a base model is continued on curated data to strengthen specific capabilities before instruction tuning.