Tag
This paper studies how fill-in-the-middle (FIM) pretraining affects verbatim memorization, finding that FIM more often recovers short spans while standard left-to-right training recovers long exact continuations, and that memorization under FIM grows linearly with repetitions.
OpenAI presents a simple data augmentation technique that enables autoregressive language models to perform fill-in-the-middle (FIM) text generation without harming left-to-right performance, with extensive ablations and best practices provided for training such models.