Efficient training of language models to fill in the middle
Summary
OpenAI presents a simple data augmentation technique that enables autoregressive language models to perform fill-in-the-middle (FIM) text generation without harming left-to-right performance, with extensive ablations and best practices provided for training such models.
View Cached Full Text
Cached at: 04/20/26, 02:55 PM
Similar Articles
Memorization Dynamics of Fill-in-the-Middle Pretraining
This paper studies how fill-in-the-middle (FIM) pretraining affects verbatim memorization, finding that FIM more often recovers short spans while standard left-to-right training recovers long exact continuations, and that memorization under FIM grows linearly with repetitions.
Extracting Training Data from Diffusion Language Models via Infilling
This paper introduces infilling extraction, a new method for extracting training data from diffusion language models by using arbitrary binary masks, showing that such models are more vulnerable to memorization attacks than previously thought.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
This paper investigates how using diverse self-generated data during mid-training improves the effectiveness of Reinforcement Learning in Large Language Models, particularly for reasoning tasks.
Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining
This paper investigates training-time data augmentation techniques to mitigate overfitting in autoregressive language model pretraining under data-constrained, compute-abundant regimes, finding that combining token-level noise, sequence permutations, and target offset prediction improves validation loss.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
This paper proposes mid-training language models on self-generated diverse reasoning traces before reinforcement learning, showing improved RL performance on math benchmarks by exposing models to multiple valid solution approaches.