Efficient training of language models to fill in the middle

OpenAI Blog 07/28/22, 07:00 AM Papers

Summary

OpenAI presents a simple data augmentation technique that enables autoregressive language models to perform fill-in-the-middle (FIM) text generation without harming left-to-right performance, with extensive ablations and best practices provided for training such models.

No content available

Original Article

View Cached Full Text

Cached at: 04/20/26, 02:55 PM

# Efficient training of language models to fill in the middle Source: [https://openai.com/index/efficient-training-of-language-models-to-fill-in-the-middle/](https://openai.com/index/efficient-training-of-language-models-to-fill-in-the-middle/) ## Abstract We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end\. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left\-to\-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales\. Given the usefulness, simplicity, and efficiency of training models to fill\-in\-the\-middle \(FIM\), we suggest that future autoregressive language models be trained with FIM by default\. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span\. We use these ablations to prescribe strong default settings and best practices to train FIM models\. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research\.

Efficient training of language models to fill in the middle

Similar Articles

Memorization Dynamics of Fill-in-the-Middle Pretraining

Extracting Training Data from Diffusion Language Models via Infilling

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Submit Feedback

Similar Articles

Memorization Dynamics of Fill-in-the-Middle Pretraining

Extracting Training Data from Diffusion Language Models via Infilling

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models