Memorization Dynamics of Fill-in-the-Middle Pretraining

arXiv cs.CL Papers

Summary

This paper studies how fill-in-the-middle (FIM) pretraining affects verbatim memorization, finding that FIM more often recovers short spans while standard left-to-right training recovers long exact continuations, and that memorization under FIM grows linearly with repetitions.

arXiv:2605.22981v1 Announce Type: new Abstract: Fill-in-the-middle (FIM) is a pretraining objective widely used to equip causal language models with infilling ability, yet its effect on verbatim memorization remains underexplored. We study the memorization dynamics of FIM in a controlled setting by pretraining matched Llama 3.2 models with FIM and standard left-to-right (LTR) objectives on a FineWeb-Gutenberg corpus containing repeated Gutenberg excerpts. With prefix-based probes, FIM more often recovers short or partially matching spans, while LTR more often assigns high confidence to long exact continuations. We observe that verbatim extraction under FIM-training grows approximately linearly with repetitions over the tested range. Evaluating native FIM-format probes reveals that suffix context is not sufficient: verbatim recall under FIM-training remains strongly anchored in prefix context. Our results also show that evaluating only one span length or probing format can miss important nuances in memorization behavior.
Original Article
View Cached Full Text

Cached at: 05/25/26, 08:56 AM

# Memorization Dynamics of Fill-in-the-Middle Pretraining
Source: [https://arxiv.org/html/2605.22981](https://arxiv.org/html/2605.22981)
###### Abstract

Fill\-in\-the\-middle \(FIM\) is a pretraining objective widely used to equip causal language models with infilling ability, yet its effect on verbatim memorization remains underexplored\. We study the memorization dynamics of FIM in a controlled setting by pretraining matched Llama 3\.2 models with FIM and standard left\-to\-right \(LTR\) objectives on a FineWeb\-Gutenberg corpus containing repeated Gutenberg excerpts\. With prefix\-based probes, FIM more often recovers short or partially matching spans, while LTR more often assigns high confidence to long exact continuations\. We observe that verbatim extraction under FIM\-training grows approximately linearly with repetitions over the tested range\. Evaluating native FIM\-format probes reveals that suffix context is not sufficient: verbatim recall under FIM\-training remains strongly anchored in prefix context\. Our results also show that evaluating only one span length or probing format can miss important nuances in memorization behavior\.

memorization, fill\-in\-the\-middle, large language models, pretraining

## 1Introduction

Large language models can reproduce training data, including rare strings, private information, code, and book passages\(Carliniet al\.,[2019](https://arxiv.org/html/2605.22981#bib.bib10),[2021](https://arxiv.org/html/2605.22981#bib.bib11); Nasret al\.,[2025](https://arxiv.org/html/2605.22981#bib.bib14); Cooperet al\.,[2026](https://arxiv.org/html/2605.22981#bib.bib16)\)\. Early work measured unintended memorization with synthetic canaries and exposure scores\(Carliniet al\.,[2019](https://arxiv.org/html/2605.22981#bib.bib10)\); later attacks extracted real training examples\(Carliniet al\.,[2021](https://arxiv.org/html/2605.22981#bib.bib11); Nasret al\.,[2025](https://arxiv.org/html/2605.22981#bib.bib14)\)\. Recent work studies leakage beyond greedy decoding, including probabilistic extraction\(Hayeset al\.,[2025](https://arxiv.org/html/2605.22981#bib.bib17)\), book\-level extraction\(Cooperet al\.,[2026](https://arxiv.org/html/2605.22981#bib.bib16)\), and membership\-style tests\(Matternet al\.,[2023](https://arxiv.org/html/2605.22981#bib.bib18); Shiet al\.,[2024](https://arxiv.org/html/2605.22981#bib.bib19)\)\.

Repetition is one of the clearest predictors of memorization\. Deduplication reduces verbatim generations\(Leeet al\.,[2022](https://arxiv.org/html/2605.22981#bib.bib20)\); duplicate count predicts regeneration\(Kandpalet al\.,[2022](https://arxiv.org/html/2605.22981#bib.bib21)\); and controlled injections are recovered more often as exposure increases\(Huanget al\.,[2024](https://arxiv.org/html/2605.22981#bib.bib22)\)\. Attribution remains difficult because prior predictability, near duplicates, tokenization, prompt position, and available context can all affect recovery\(Kharitonovet al\.,[2021](https://arxiv.org/html/2605.22981#bib.bib23); Zhanget al\.,[2023](https://arxiv.org/html/2605.22981#bib.bib24); Shilovet al\.,[2026](https://arxiv.org/html/2605.22981#bib.bib25); Liuet al\.,[2024](https://arxiv.org/html/2605.22981#bib.bib26); Xuet al\.,[2026](https://arxiv.org/html/2605.22981#bib.bib13)\)\.

We study fill\-in\-the\-middle \(FIM\), a common pretraining objective for causal language models\(Bavarianet al\.,[2022](https://arxiv.org/html/2605.22981#bib.bib27)\)\. Standard left\-to\-right \(LTR\) training predicts each token from its prefix\. FIM training moves a target middle span after prefix and suffix, separated by sentinel tokens, such that during training, the target is exposed to right context as well as left context\. Infilling is used in systems such as DeepSeek\-v3, InCoder, StarCoder, and Code Llama\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2605.22981#bib.bib12); Friedet al\.,[2023](https://arxiv.org/html/2605.22981#bib.bib28); Liet al\.,[2023](https://arxiv.org/html/2605.22981#bib.bib29); Rozièreet al\.,[2023](https://arxiv.org/html/2605.22981#bib.bib30)\)\. Prior work has mainly emphasized infilling utility; here we ask how the objective impacts verbatim extraction\.

We conduct a controlled study comparing standard LTR and FIM pretraining under matched architecture and data source, asking three related questions:

1. \(i\)How does FIM impact verbatim memorization across target span lengths, extraction thresholds, and repetition?
2. \(ii\)Under native FIM prompting, how do prefix context, suffix context, and sentinel tokens contribute to verbatim memorization?
3. \(iii\)Are the observed effects specific to extraction geometry, or explained by broad model\-quality differences?

## 2Study Design

We compare paired LTR and FIM models trained on the same data, architecture and parameters\. Our controlled conditions let us attribute differences in memorization to the pretraining format\.

### 2\.1Matched Training with Controlled Repetition

The bulk corpus is FineWeb 100B, while our controlled memorization corpus consists of Project Gutenberg books\(Penedoet al\.,[2024](https://arxiv.org/html/2605.22981#bib.bib31); Project Gutenberg,[n\.d\.](https://arxiv.org/html/2605.22981#bib.bib32)\)\. We score 4096\-token windows of Gutenberg books with a Llama 3\.2 model\(Llama Team,[2024](https://arxiv.org/html/2605.22981#bib.bib35)\)trained only on FineWeb, in order to filter out pre\-memorized, outlier, and duplicate windows\. The resulting cleaned set of excerpts is split into 12 repetition buckets of 2,810 excerpts with exposures from 1 to 128\. We balance bucket assignment by prior perplexity\.

We build two corpora from the same data sources\. The LTR corpus keeps autoregressive order\. The FIM corpus rewrites examples into sentinel\-delimited prefix–suffix–middle order, where the spans are randomly partitioned\. In particular, repeated FIM copies use different split points, so repetition is document\-level exposure rather than fixed middle\-span exposure\. The FIM\-corpus contains50%50\\%FIM\-documents for FineWeb \(the rest being LTR\) and100%100\\%FIM\-documents for Gutenberg\.

Both models use an identical Llama 3\.2 3B architecture and are trained over one epoch of≈103\\approx 103B tokens \(≈95%/5%\\approx 95\\%/5\\%FineWeb/Gutenberg\)\. Further experimental details are listed in[AppendixA](https://arxiv.org/html/2605.22981#A1)and model size is ablated in[SectionB\.2](https://arxiv.org/html/2605.22981#A2.SS2)\.

### 2\.2Downstream performance

We evaluate both models on 8 tasks of the LM Evaluation Harness\(Gaoet al\.,[2023](https://arxiv.org/html/2605.22981#bib.bib8)\), and observe that both models achieve nearly identical performance\. Detailed metrics are provided in[SectionB\.1](https://arxiv.org/html/2605.22981#A2.SS1)\. We conclude that differences in memorization are not due to differences in model capabilities in the context of our study\.

## 3Prefix\-only Extraction

We compare FIM and LTR with the same prefix\-only probe: using 100 prefix tokens to predict a spanzzofM=32M=32target tokens\. For each repetition bucket, we probe both models on the same Gutenberg windows, sampling 10 disjunct windows per excerpt\.

We report two criteria\. First, inspired byCooperet al\.\([2026](https://arxiv.org/html/2605.22981#bib.bib16)\), exact extraction computespz=∏i=1Mqip\_\{z\}=\\prod\_\{i=1\}^\{M\}q\_\{i\}, whereqiq\_\{i\}is the top\-kk\-renormalized probability of theii\-th target token underk=40,T=1k=40,T=1\. A target is calledextractableifpz≥0\.1%p\_\{z\}\\geq 0\.1\\%\. Second, we generateMMtokens starting from the prefix autoregressively and report ROUGE\-L\(Lin,[2004](https://arxiv.org/html/2605.22981#bib.bib9)\), with ROUGE\-L≥0\.5\\geq 0\.5indicatinghigh\-overlap recoveryfollowingChenet al\.\([2025](https://arxiv.org/html/2605.22981#bib.bib33)\)\. UsingM=32M=32lets us evaluate both criteria on the same windows\. This is less strict per token than theM=50M=50setting inCooperet al\.\([2026](https://arxiv.org/html/2605.22981#bib.bib16)\)\(80\.6%80\.6\\%vs\.87\.1%87\.1\\%geometric mean\)\. We varyMMin[Figure3](https://arxiv.org/html/2605.22981#S3.F3)\.

![Refer to caption](https://arxiv.org/html/2605.22981v1/x1.png)\(a\)Verbatim extraction rate
![Refer to caption](https://arxiv.org/html/2605.22981v1/x2.png)\(b\)High\-overlap recovery rate

Figure 1:Memorization across repetition buckets\.For strict full\-span extraction,LTRis higher in aggregate, butFIMextracts more windows at the largest repetition bucket\.FIMyields stronger high\-overlap recovery for high repetitions\.FineWebis the baseline trained only on FineWeb\. Shaded bands denote nominal 95% confidence intervals for the per\-window rate\.For the exact extraction criterion, LTR overall memorizes more windows: 3,279 windows satisfypz≥0\.1%p\_\{z\}\\geq 0\.1\\%, versus 2,230 for FIM\. FIM is slightly higher on broader recovery measures, including mean ROUGE\-L \(0\.198 for FIM vs 0\.190 for LTR\), and mean top\-kksupport rate \(87\.09% vs 86\.18%\), i\.e\., the fraction of reference tokens contained in the top\-kkof logits withk=40k=40\. The low memorization rate is partly due to probe position\. Beginning\-of\-excerpt probes memorize significantly more than randomly sampled windows \([Figure7](https://arxiv.org/html/2605.22981#A2.F7)of[SectionB\.3](https://arxiv.org/html/2605.22981#A2.SS3)\)\.

While the FIM model’s support is higher, probability mass is less concentrated on complete 32\-token continuations\. The exact extraction criterion is strict, such that few low\-probability tokens can collapse thepzp\_\{z\}\{\}of a target span\. A threshold sweep at repetition128128confirms this:[Figure2](https://arxiv.org/html/2605.22981#S3.F2)shows that FIM has more mass at moderatepzp\_\{z\}\{\}, but LTR has the heavier tail, and therefore extracts more at the0\.1%0\.1\\%threshold\.

![Refer to caption](https://arxiv.org/html/2605.22981v1/x3.png)Figure 2:Extraction survival curvesat repetition128128show thatFIMassigns more mass to moderately likely targets, butLTRhas the heavier high\-confidence tail\. Each line gives the percentage of evaluated target windows withpz≥tp\_\{z\}\\geq tas the extraction thresholdttvaries\. The 95% confidence intervals are smaller than the line width\.![Refer to caption](https://arxiv.org/html/2605.22981v1/x4.png)Figure 3:Extraction rates under varying target lengthsshow that the repetitions required forFIMto overtakeLTRincreases with span length, because longer spans favor LTR’s heavier tail\. Curves show the fraction of windows withpz≥0\.1%p\_\{z\}\\geq 0\.1\\%for the first 20, 30, 40, and 50 target tokens; all panels use the same y\-axis scale\. Shaded bands denote nominal 95% confidence intervals for the per\-window rate\.In line withHuanget al\.\([2024](https://arxiv.org/html/2605.22981#bib.bib22)\), we find that non\-trivial repetitions are required for memorization\. This is expected, especially at the 3B model scale, since memorization increases with model capacity\(Carliniet al\.,[2023](https://arxiv.org/html/2605.22981#bib.bib15)\)\. We study a 1B ablation in[SectionB\.2](https://arxiv.org/html/2605.22981#A2.SS2)\. With more repetitions, LTR extraction shows diminishing returns, consistent with the logarithmic trend reported inCarliniet al\.\([2023](https://arxiv.org/html/2605.22981#bib.bib15)\)\. While FIM\-extraction rises more steadily with repetitions, it remains low for small repetition counts\. We ablate the target length in[Figure3](https://arxiv.org/html/2605.22981#S3.F3)and conclude that the number of repetitions required for FIM to surpass LTR in extraction increases with span length\. This is because a longer target makes extraction stricter, such that LTR’s heavy\-tailed distribution dominates\.

We analyze attention patterns to further contextualize our insights\. For each target\-position prediction query, we partition the attention between \(i\) the prefix tokens and \(ii\) the already\-seen target tokens\. The latter is zero for the first target token of the target span and, for later positions, includes all earlier target tokens in the target span\. We average over target positions and windows and report the mean attention allocation in[Table1](https://arxiv.org/html/2605.22981#S3.T1)\. The FIM model places more attention on the prefix and less on already\-seen target tokens compared to the LTR model\.

Our observations can be explained by the structure of the FIM objective\. Repeated LTR examples present each passage under the same left\-to\-right view\. This concentrates probability mass into fewer long continuations, leading to the heavy\-tailed distribution with increased extraction\. Repeated FIM examples instead expose the same passage through varied prefix–middle–suffix decompositions, spreading mass across more partial reconstructions and broadening recoverability\.

Table 1:Mean attention allocation during prediction of the target span\. Both models rely primarily on the prefix, but FIM relies on it more strongly, while LTR allocates relatively more attention to earlier target tokens\. Nominal 95% confidence intervals are below10−410^\{\-4\}\.
## 4Native FIM probing

Since the native FIM\-format includes both left and right context, it fundamentally differs from the prefix\-only extraction prompt\. We study the FIM\-native format to evaluate how prefix and suffix context redistribute attention and contribute to memorization\. As before, we sample 10 disjunct windows for each excerpt and the target remains 32 tokens\. However, the 100\-token context is now split across prefix and suffix\. Additionally, we focus our analyses on the 128\-repetition bucket, in which memorization is most prevalent\. Note that this probing format includes the FIM\-sentinel tokens, so even if the suffix is empty, it still differs from the prefix prompt evaluated in[Section3](https://arxiv.org/html/2605.22981#S3)\.

In[Figure4](https://arxiv.org/html/2605.22981#S4.F4), we vary the prefix–suffix split around a fixed target to test which side of the native FIM context contributes more to memorization support\. As the prefix grows and the suffix shrinks, top\-kksupport increases monotonically\. The same trend holds within all repetition buckets and for both extraction rates and target likelihood \(see[SectionB\.3](https://arxiv.org/html/2605.22981#A2.SS3)\)\. In all repetition buckets, moving from suffix\-only to prefix\-only context, target perplexity falls from 60\.23 to 27\.93, while top\-kksupport rises from 77\.60% to 85\.52%\. The sharp drop when little or no prefix is available reflects the autoregressive structure of causal language models: without left context, the model has no reliable starting point for generating the middle span\.

![Refer to caption](https://arxiv.org/html/2605.22981v1/x5.png)Figure 4:Target\-token top\-kksupportunder native FIM geometry at 128 repetitions shows that memorization improves monotonically as more of the 100\-token context budget is allocated to the prefix rather than the suffix\. The x\-axis variesprefix/suffixlengths\. The line shows the percentage of target tokens included in top\-4040support\. The 95% confidence intervals are smaller than the line width\.While prefix\-heavy native FIM prompts elicit stronger memorization, the suffix still provides conditioning\. The attention analysis in[Figure5](https://arxiv.org/html/2605.22981#S4.F5)shows substantial attention allocated to both prefix and suffix, with the prefix receiving slightly more attention\. For prompts with very little prefix, the model compensates by attending more heavily to preceding tokens of the target span\.

![Refer to caption](https://arxiv.org/html/2605.22981v1/x6.png)Figure 5:Attention allocationunder native FIM probing shows that the model uses both surrounding contexts, with more attention on the prefix than the suffix, and shifts attention toward earlier target tokens when little prefix is available\. The stacked areas show mean attention mass assigned to prefix tokens, suffix tokens, FIM sentinels, and earlier target tokens within the target span, averaged over target\-token prediction queries and repetition buckets\. The x\-axis variesprefix/suffixlengths\.To isolate the contribution of prefix and suffix context directly, we keep the target fixed and replace the prefix, the suffix, or both with same\-length unrelateddistractor spansfrom different Gutenberg excerpts\. We consider excerpts in the 128\-repetition bucket and vary the prefix–suffix ratio, keeping the total context budget fixed\.[Figure6](https://arxiv.org/html/2605.22981#S4.F6)shows the top\-kksupport in this setting\. We deduce that prefix and suffix are not equally significant\. Recall is strongest when the available context is allocated to the prefix\. As expected, the full prompt yields the strongest top\-kksupport across the sweep, serving as an upper\-bound reference for the distractor\-span conditions\. While replacing the suffix with a distractor reduces recall, replacing the prefix has a significantly larger effect\. When both sides are replaced by distractors, we verify that support drops sharply, confirming that the effect is not only due to prompt length or sentinel structure\.

![Refer to caption](https://arxiv.org/html/2605.22981v1/x7.png)Figure 6:Target\-token top\-kksupportunder native FIM prompting at 128 repetitions and different distractor conditions\. Replacing the prefix harms recall more than replacing the suffix, confirming that prefix context is the stronger driver of memorization\. The x\-axis variesprefix/suffixlengths\. The 95% confidence intervals are smaller than the line width\.
## 5Conclusion

Matched LTR and FIM models trained on a corpus containing repeated book excerpts show that the pretraining objective shapes how memorization accumulates\. Under prefix\-only probes, FIM improves short\-span and overlap\-based recovery, especially at high repetitions, while LTR produces more high\-confidence long exact continuations\.

Repetitions are not identical under the two objectives\. In LTR, repeated excerpts reinforce the same single left\-to\-right view of each excerpt, and extraction grows logarithmically before saturating\. In FIM, the same repeated excerpts appear in different prefix–middle–suffix decompositions\. This makes memorization slower at first, but it can exceed LTR on short\-span extraction at high repetition\. Native FIM probes further show that while suffixes help, a short true prefix is necessary for extraction\. Replacing the true prefix with a distractor prefix nearly suppresses memorization, while replacing the true suffix with a distractor suffix has a smaller effect\. Our results show that LTR and FIM expose different memorization profiles and that memorization in FIM remains strongly anchored to the prefix\.

### 5\.1Limitations and Outlook

Since we pretrain from scratch, we are not able to study frontier\-scale models\. Repetition counts are bounded to 128, covering a practically relevant range, but do not allow extrapolation in the limit\. The main conceptual limitation is attribution: under random FIM decompositions, a probed span need not match a specific middle span seen during training, so the results do not allow us to trace exact exposures\. Beyond our random\-window probes, future work can investigate how prompt position impacts FIM and use span\-to\-training mappings to test whether the patterns persist across different probes and longer extraction windows\.

## Acknowledgements

We thank Yixuan Xu and Imanol Schlag for their guidance and feedback\. This work was supported as part of the Swiss AI Initiative by compute grant infra01 from the Swiss National Supercomputing Centre \(CSCS\) on Alps\.

## Impact Statement

This work studies how fill\-in\-the\-middle pretraining affects verbatim extraction of repeated text\. The results can inform data curation and memorization audits for models with infilling capability\. We do not introduce a new attack or release sensitive training examples\. The main risk is that better measurement may also help identify settings where extraction is easier; we view this as necessary for evaluating and reducing memorization in deployed systems\.

## References

- M\. Bavarian, H\. Jun, N\. A\. Tezak, J\. Schulman, C\. McLeavey, J\. Tworek, and M\. Chen \(2022\)Efficient training of language models to fill in the middle\.ArXivabs/2207\.14255\.External Links:[Link](https://api.semanticscholar.org/CorpusID:251135268)Cited by:[§A\.2](https://arxiv.org/html/2605.22981#A1.SS2.p1.3),[§1](https://arxiv.org/html/2605.22981#S1.p3.1)\.
- Y\. Bisk, R\. Zellers, R\. L\. Bras, J\. Gao, and Y\. Choi \(2020\)PIQA: reasoning about physical commonsense in natural language\.InThe Thirty\-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty\-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7\-12, 2020,pp\. 7432–7439\.External Links:[Link](https://doi.org/10.1609/aaai.v34i05.6239),[Document](https://dx.doi.org/10.1609/AAAI.V34I05.6239)Cited by:[Table 3](https://arxiv.org/html/2605.22981#A2.T3.10.8.2)\.
- N\. Carlini, D\. Ippolito, M\. Jagielski, K\. Lee, F\. Tramèr, and C\. Zhang \(2023\)Quantifying memorization across neural language models\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=TatRHT%5C_1cK)Cited by:[§B\.2](https://arxiv.org/html/2605.22981#A2.SS2.p1.1),[§3](https://arxiv.org/html/2605.22981#S3.p5.1)\.
- N\. Carlini, C\. Liu, Ú\. Erlingsson, J\. Kos, and D\. Song \(2019\)The secret sharer: evaluating and testing unintended memorization in neural networks\.In28th USENIX security symposium \(USENIX security 19\),pp\. 267–284\.Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p1.1)\.
- N\. Carlini, F\. Tramèr, E\. Wallace, M\. Jagielski, A\. Herbert\-Voss, K\. Lee, A\. Roberts, T\. Brown, D\. Song, Ú\. Erlingsson, A\. Oprea, and C\. Raffel \(2021\)Extracting training data from large language models\.In30th USENIX Security Symposium \(USENIX Security 21\),pp\. 2633–2650\.External Links:ISBN 978\-1\-939133\-24\-3,[Link](https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p1.1)\.
- T\. Chen, F\. Brahman, J\. Liu, N\. Mireshghallah, W\. Shi, P\. W\. Koh, L\. Zettlemoyer, and H\. Hajishirzi \(2025\)ParaPO: aligning language models to reduce verbatim reproduction of pre\-training data\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Uic3ojVhXh)Cited by:[§3](https://arxiv.org/html/2605.22981#S3.p2.13)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the AI2 reasoning challenge\.CoRRabs/1803\.05457\.External Links:[Link](http://arxiv.org/abs/1803.05457),1803\.05457Cited by:[Table 3](https://arxiv.org/html/2605.22981#A2.T3.7.5.2),[Table 3](https://arxiv.org/html/2605.22981#A2.T3.8.6.2)\.
- A\. F\. Cooper, M\. A\. Lemley, A\. Casasola, A\. Ahmed, A\. Gokaslan, A\. B\. Cyphert, C\. D\. Sa, D\. E\. Ho, and P\. Liang \(2026\)Extracting memorized pieces of \(copyrighted\) books from open\-weight language models\.External Links:2505\.12546,[Link](https://arxiv.org/abs/2505.12546)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p1.1),[§3](https://arxiv.org/html/2605.22981#S3.p2.13)\.
- DeepSeek\-AI, A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan, D\. Dai, D\. Guo, D\. Yang, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Bao, H\. Xu, H\. Wang, H\. Zhang, H\. Ding, H\. Xin, H\. Gao, H\. Li, H\. Qu, J\. L\. Cai, J\. Liang, J\. Guo, J\. Ni, J\. Li, J\. Wang, J\. Chen, J\. Chen, J\. Yuan, J\. Qiu, J\. Li, J\. Song, K\. Dong, K\. Hu, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, L\. Zhao, L\. Wang, L\. Zhang, M\. Li, M\. Wang, M\. Zhang, M\. Zhang, M\. Tang, M\. Li, N\. Tian, P\. Huang, P\. Wang, P\. Zhang, Q\. Wang, Q\. Zhu, Q\. Chen, Q\. Du, R\. J\. Chen, R\. L\. Jin, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. Xu, R\. Zhang, R\. Chen, S\. S\. Li, S\. Lu, S\. Zhou, S\. Chen, S\. Wu, S\. Ye, S\. Ye, S\. Ma, S\. Wang, S\. Zhou, S\. Yu, S\. Zhou, S\. Pan, T\. Wang, T\. Yun, T\. Pei, T\. Sun, W\. L\. Xiao, W\. Zeng, W\. Zhao, W\. An, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, X\. Q\. Li, X\. Jin, X\. Wang, X\. Bi, X\. Liu, X\. Wang, X\. Shen, X\. Chen, X\. Zhang, X\. Chen, X\. Nie, X\. Sun, X\. Wang, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yu, X\. Song, X\. Shan, X\. Zhou, X\. Yang, X\. Li, X\. Su, X\. Lin, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. X\. Zhu, Y\. Zhang, Y\. Xu, Y\. Xu, Y\. Huang, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Li, Y\. Wang, Y\. Yu, Y\. Zheng, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Tang, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Wu, Y\. Ou, Y\. Zhu, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Zha, Y\. Xiong, Y\. Ma, Y\. Yan, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Z\. F\. Wu, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Huang, Z\. Zhang, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Gou, Z\. Ma, Z\. Yan, Z\. Shao, Z\. Xu, Z\. Wu, Z\. Zhang, Z\. Li, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Gao, and Z\. Pan \(2025\)DeepSeek\-v3 technical report\.External Links:2412\.19437,[Link](https://arxiv.org/abs/2412.19437)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p3.1)\.
- D\. Fried, A\. Aghajanyan, J\. Lin, S\. Wang, E\. Wallace, F\. Shi, R\. Zhong, S\. Yih, L\. Zettlemoyer, and M\. Lewis \(2023\)InCoder: a generative model for code infilling and synthesis\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hQwb-lbM6EL)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p3.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2023\)A framework for few\-shot language model evaluation\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.10256836),[Link](https://zenodo.org/records/10256836)Cited by:[§B\.1](https://arxiv.org/html/2605.22981#A2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.22981#S2.SS2.p1.1)\.
- J\. Hayes, M\. Swanberg, H\. Chaudhari, I\. Yona, I\. Shumailov, M\. Nasr, C\. A\. Choquette\-Choo, K\. Lee, and A\. F\. Cooper \(2025\)Measuring memorization in language models via probabilistic extraction\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 9266–9291\.Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[Table 3](https://arxiv.org/html/2605.22981#A2.T3.5.3.2)\.
- J\. Huang, D\. Yang, and C\. Potts \(2024\)Demystifying verbatim memorization in large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 10711–10732\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.598/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.598)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p2.1),[§3](https://arxiv.org/html/2605.22981#S3.p5.1)\.
- N\. Kandpal, E\. Wallace, and C\. Raffel \(2022\)Deduplicating training data mitigates privacy risks in language models\.InInternational Conference on Machine Learning, ICML 2022, 17\-23 July 2022, Baltimore, Maryland, USA,K\. Chaudhuri, S\. Jegelka, L\. Song, C\. Szepesvári, G\. Niu, and S\. Sabato \(Eds\.\),Proceedings of Machine Learning Research,pp\. 10697–10707\.External Links:[Link](https://proceedings.mlr.press/v162/kandpal22a.html)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p2.1)\.
- E\. Kharitonov, M\. Baroni, and D\. Hupkes \(2021\)How BPE affects memorization in transformers\.CoRRabs/2110\.02782\.External Links:[Link](https://arxiv.org/abs/2110.02782),2110\.02782Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p2.1)\.
- K\. Lee, D\. Ippolito, A\. Nystrom, C\. Zhang, D\. Eck, C\. Callison\-Burch, and N\. Carlini \(2022\)Deduplicating training data makes language models better\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 8424–8445\.External Links:[Link](https://aclanthology.org/2022.acl-long.577/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.577)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p2.1)\.
- R\. Li, L\. B\. allal, Y\. Zi, N\. Muennighoff, D\. Kocetkov, C\. Mou, M\. Marone, C\. Akiki, J\. LI, J\. Chim, Q\. Liu, E\. Zheltonozhskii, T\. Y\. Zhuo, T\. Wang, O\. Dehaene, J\. Lamy\-Poirier, J\. Monteiro, N\. Gontier, M\. Yee, L\. K\. Umapathi, J\. Zhu, B\. Lipkin, M\. Oblokulov, Z\. Wang, R\. Murthy, J\. T\. Stillerman, S\. S\. Patel, D\. Abulkhanov, M\. Zocca, M\. Dey, Z\. Zhang, U\. Bhattacharyya, W\. Yu, S\. Luccioni, P\. Villegas, F\. Zhdanov, T\. Lee, N\. Timor, J\. Ding, C\. S\. Schlesinger, H\. Schoelkopf, J\. Ebert, T\. Dao, M\. Mishra, A\. Gu, C\. J\. Anderson, B\. Dolan\-Gavitt, D\. Contractor, S\. Reddy, D\. Fried, D\. Bahdanau, Y\. Jernite, C\. M\. Ferrandis, S\. Hughes, T\. Wolf, A\. Guha, L\. V\. Werra, and H\. de Vries \(2023\)StarCoder: may the source be with you\!\.Transactions on Machine Learning Research\.Note:Reproducibility CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=KoFOg41haE)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p3.1)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013/)Cited by:[§3](https://arxiv.org/html/2605.22981#S3.p2.13)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.External Links:[Link](https://aclanthology.org/2024.tacl-1.9/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00638)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p2.1)\.
- Llama Team \(2024\)The llama 3 herd of models\.CoRRabs/2407\.21783\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.21783),[Document](https://dx.doi.org/10.48550/ARXIV.2407.21783),2407\.21783Cited by:[§2\.1](https://arxiv.org/html/2605.22981#S2.SS1.p1.1)\.
- J\. Mattern, F\. Mireshghallah, Z\. Jin, B\. Schölkopf, M\. Sachan, and T\. Berg\-Kirkpatrick \(2023\)Membership inference attacks against language models via neighbourhood comparison\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 11330–11343\.External Links:[Link](https://aclanthology.org/2023.findings-acl.719/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.719)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p1.1)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2017\)Pointer sentinel mixture models\.In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24\-26, 2017, Conference Track Proceedings,External Links:[Link](https://openreview.net/forum?id=Byj72udxe)Cited by:[Table 3](https://arxiv.org/html/2605.22981#A2.T3.12.10.2)\.
- M\. Nasr, J\. Rando, N\. Carlini, J\. Hayase, M\. Jagielski, A\. F\. Cooper, D\. Ippolito, C\. A\. Choquette\-Choo, F\. Tramèr, and K\. Lee \(2025\)Scalable extraction of training data from aligned, production language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=vjel3nWP2a)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p1.1)\.
- G\. Penedo, H\. Kydlíček, L\. B\. allal, A\. Lozhkov, M\. Mitchell, C\. Raffel, L\. V\. Werra, and T\. Wolf \(2024\)The fineweb datasets: decanting the web for the finest text data at scale\.InThe Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=n6SCkn2QaG)Cited by:[§2\.1](https://arxiv.org/html/2605.22981#S2.SS1.p1.1)\.
- Project Gutenberg \(n\.d\.\)Project gutenberg\.Note:[https://www\.gutenberg\.org](https://www.gutenberg.org/)Accessed: 2026\-05\-04Cited by:[§2\.1](https://arxiv.org/html/2605.22981#S2.SS1.p1.1)\.
- B\. Rozière, J\. Gehring, F\. Gloeckle, S\. Sootla, I\. Gat, X\. Tan, Y\. Adi, J\. Liu, T\. Remez, J\. Rapin, A\. Kozhevnikov, I\. Evtimov, J\. Bitton, M\. P\. Bhatt, C\. C\. Ferrer, A\. Grattafiori, W\. Xiong, A\. D’efossez, J\. Copet, F\. Azhar, H\. Touvron, L\. Martin, N\. Usunier, T\. Scialom, and G\. Synnaeve \(2023\)Code llama: open foundation models for code\.ArXivabs/2308\.12950\.External Links:[Link](https://api.semanticscholar.org/CorpusID:261100919)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p3.1)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2020\)WinoGrande: an adversarial winograd schema challenge at scale\.InThe Thirty\-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty\-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7\-12, 2020,pp\. 8732–8740\.External Links:[Link](https://doi.org/10.1609/aaai.v34i05.6399),[Document](https://dx.doi.org/10.1609/AAAI.V34I05.6399)Cited by:[Table 3](https://arxiv.org/html/2605.22981#A2.T3.11.9.2)\.
- W\. Shi, A\. Ajith, M\. Xia, Y\. Huang, D\. Liu, T\. Blevins, D\. Chen, and L\. Zettlemoyer \(2024\)Detecting pretraining data from large language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=zWqr3MQuNs)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p1.1)\.
- I\. Shilov, M\. Meeus, and Y\. de Montjoye \(2026\)The mosaic memory of large language models\.Nature Communications17\(1\)\.External Links:[Document](https://dx.doi.org/10.1038/s41467-026-68603-0),[Link](http://dx.doi.org/10.1038/s41467-026-68603-0),ISSN 2041\-1723Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p2.1)\.
- A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2019\)CommonsenseQA: A question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT 2019, Minneapolis, MN, USA, June 2\-7, 2019, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),pp\. 4149–4158\.External Links:[Link](https://doi.org/10.18653/v1/n19-1421),[Document](https://dx.doi.org/10.18653/V1/N19-1421)Cited by:[Table 3](https://arxiv.org/html/2605.22981#A2.T3.9.7.2)\.
- Y\. Xu, A\. Bosselut, and I\. Schlag \(2026\)Positional fragility in LLMs: how offset effects reshape our understanding of memorization risks\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=7dBPm5c5ue)Cited by:[Figure 7](https://arxiv.org/html/2605.22981#A2.F7),[Figure 7](https://arxiv.org/html/2605.22981#A2.F7.4.2),[§1](https://arxiv.org/html/2605.22981#S1.p2.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers,A\. Korhonen, D\. R\. Traum, and L\. Màrquez \(Eds\.\),pp\. 4791–4800\.External Links:[Link](https://doi.org/10.18653/v1/p19-1472),[Document](https://dx.doi.org/10.18653/V1/P19-1472)Cited by:[Table 3](https://arxiv.org/html/2605.22981#A2.T3.6.4.2)\.
- C\. Zhang, D\. Ippolito, K\. Lee, M\. Jagielski, F\. Tramèr, and N\. Carlini \(2023\)Counterfactual memorization in neural language models\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=67o9UQgTD0)Cited by:[§1](https://arxiv.org/html/2605.22981#S1.p2.1)\.

## Appendix AExperimental Details

### A\.1Training Parameters

Both paired models use the Llama 3\.2 3B architecture implemented in Megatron\-LM with packed sequences and FlashAttention\.[Table2](https://arxiv.org/html/2605.22981#A1.T2)gives the fixed backbone configuration\.

Table 2:Backbone parameters\.The LTR and FIM runs use the same tokenizer, architecture, precision, and context\-length configuration\.Both paired runs use packed THD\-format sequences, sequence length 16,384, micro\-batch size 1, no dropout, and global batch size 2048 across 64 GH200 GPUs\. One optimization step therefore consumes 33,554,432 tokens\. The LTR run performs 3,057 updates over 102\.58B tokens\. The FIM run performs 3,064 updates over 102\.81B tokens\. We release training logs at[https://wandb\.ai/memorization\-study\-fim\-team/memorization\-study\-fim/](https://wandb.ai/memorization-study-fim-team/memorization-study-fim/)and model checkpoints on HuggingFace:[FIM 3B](https://huggingface.co/tvonarx/memfim-fim-3b),[LTR 3B](https://huggingface.co/tvonarx/memfim-ltr-3b),[FIM 1B](https://huggingface.co/tvonarx/memfim-fim-1b),[LTR 1B](https://huggingface.co/tvonarx/memfim-ltr-1b)\.

### A\.2FIM Formatting

For each FIM document, followingBavarianet al\.\([2022](https://arxiv.org/html/2605.22981#bib.bib27)\), we randomly sample two split points within the document segment, yielding a prefix𝐏\\mathbf\{P\}, middle span𝐌\\mathbf\{M\}, and suffix𝐒\\mathbf\{S\}\. The FIM condition reuses reserved Llama special\-token IDs, so we do not resize the embedding table\.

<\|fim\_prefix\|\>=128002,<\|fim\_middle\|\>=128003,<\|fim\_suffix\|\>=128005\.\\texttt\{<\|fim\\\_prefix\|\>\}=128002,\\qquad\\texttt\{<\|fim\\\_middle\|\>\}=128003,\\qquad\\texttt\{<\|fim\\\_suffix\|\>\}=128005\.
The LTR format keeps the original order:

𝐏\\mathbf\{P\}𝐌\\mathbf\{M\}𝐒\\mathbf\{S\}<\|eos\_token\|\>
The FIM format moves the middle span after its surrounding context:

<\|fim\_prefix\|\>𝐏\\mathbf\{P\}<\|fim\_suffix\|\>𝐒\\mathbf\{S\}<\|fim\_middle\|\>𝐌\\mathbf\{M\}<\|eos\_token\|\>
For FineWeb, the FIM training uses a 50% FIM / 50% LTR mixture\. For Gutenberg, every 4096\-token excerpt is formatted using the FIM\-format\. The LTR\-model is only trained on LTR sequences and contains no FIM sentinels\.

### A\.3Gutenberg Filtering and Deduplication

We filter Project Gutenberg to obtain fixed 4096\-token excerpts whose later extraction is due to controlled exposure rather than prior web familiarity\. Starting from the English split of Project Gutenberg on HuggingFace111manu/project\_gutenberg, we strip standard Gutenberg headers, footers, licenses, and archive boilerplate\. From each cleaned book, we keep characters 10,000–80,000, tokenize with the Llama 3\.2 tokenizer, split into non\-overlapping 4096\-token windows, score each window with the FineWeb\-only Llama 3\.2 3B checkpoint, and keep the highest\-PPL window per book\. We then remove windows with PPL\>500\>500, which were mostly indices, glossary fragments, OCR artifacts, or unusual formatting\.

We deduplicate with both semantic and lexical evidence\. Excerpts are embedded withnomic\-ai/nomic\-embed\-text\-v1\.5; a pair is removed only if cosine similarity is at least 0\.96 and token 5\-gram Jaccard overlap is at least 0\.20\. For each duplicate cluster, we keep the highest\-PPL excerpt\. This reduces 128,003 scored windows to 33,720 final excerpts\.

The final schedule has 12 repetition buckets with exposures1,2,3,4,8,16,24,32,48,64,96,1281,2,3,4,8,16,24,32,48,64,96,128\. Each bucket contains 2,810 base excerpts, for 1,197,060 Gutenberg training documents after replication\. Bucket assignment is balanced by FineWeb\-checkpoint PPL; bucket means range from 36\.895227 to 36\.895758\. The LTR\-format Gutenberg corpus has 4,904,354,820 tokens, and the FIM\-format Gutenberg corpus has 4,907,946,000 tokens, with the difference coming from FIM sentinel tokens\.

## Appendix BAdditional Experimental Results

### B\.1Downstream Performance

We report the detailed metrics of our matched Llama 3\.2 3B models for the LM Evaluation Harness suite\(Gaoet al\.,[2023](https://arxiv.org/html/2605.22981#bib.bib8)\)in[Table3](https://arxiv.org/html/2605.22981#A2.T3)\. The scores of the 1B\-scale ablation in[SectionB\.2](https://arxiv.org/html/2605.22981#A2.SS2)are also reported\.

Table 3:Downstream quality\-control suite by model scale\.Accuracy tasks are reported in %; higher is better\. Lower is better for Wikitext word perplexity\. Within each scale,greenmarks the better of LTR and FIM, andredmarks the worse\. Bold marks the best score in the row\.Δ\\Deltais FIM minus LTR within each scale \(pp for accuracy, absolute for PPL\)\.
### B\.2Model size ablation

Memorization has been shown to increase with model capacity\(Carliniet al\.,[2023](https://arxiv.org/html/2605.22981#bib.bib15)\)\. To test the validity and generalizability of our conclusions, we train paired Llama 3\.2 1B models in the same conditions\.

We observe that, as expected, both downstream performance and verbatim memorization decrease at smaller scale\.[Table3](https://arxiv.org/html/2605.22981#A2.T3)reports the downstream comparison, and[Figure7](https://arxiv.org/html/2605.22981#A2.F7)shows reduced memorization relative to the 3B variant\. Because exact extraction on the random\-window probes used in[Section3](https://arxiv.org/html/2605.22981#S3)is too rare at 1B scale for a stable comparison, we focus on ROUGE\-L instead\.

Importantly, note that the relative trends between LTR and FIM remain consistent with our main results in[Section3](https://arxiv.org/html/2605.22981#S3)\.

### B\.3Additional Figures

[Figures7](https://arxiv.org/html/2605.22981#A2.F7),[8](https://arxiv.org/html/2605.22981#A2.F8)and[9](https://arxiv.org/html/2605.22981#A2.F9)show additional figures omitted from the main text\.

![Refer to caption](https://arxiv.org/html/2605.22981v1/x8.png)

![Refer to caption](https://arxiv.org/html/2605.22981v1/x9.png)

Figure 7:Mean ROUGE\-L under prefix probing for 1B and 3B models, evaluated on 10 uniformly sampled windows per excerpt \(left\) and on the first window of each excerpt \(right\)\. Each prompt uses 100 prefix tokens to generate a 32\-token continuation\. Filled circles denote 3B models; hollow squares denote 1B models\. The large gap between first\-window and uniformly sampled\-window probing indicates that recall is anchored near the beginning of repeated excerpts, consistent with positional fragility observed byXuet al\.\([2026](https://arxiv.org/html/2605.22981#bib.bib13)\)\.![Refer to caption](https://arxiv.org/html/2605.22981v1/x10.png)\(a\)Mean per\-token Cooper probability\.
![Refer to caption](https://arxiv.org/html/2605.22981v1/x11.png)\(b\)Target tokens in top\-kk\.

Figure 8:Native FIM geometry by repetition bucket\.Heatmaps separate the prefix–suffix effect across repetition levels\. The x\-axis variesprefix/suffixlengths\.![Refer to caption](https://arxiv.org/html/2605.22981v1/x12.png)\(a\)Extractability
![Refer to caption](https://arxiv.org/html/2605.22981v1/x13.png)\(b\)Target tokens in top\-kk
![Refer to caption](https://arxiv.org/html/2605.22981v1/x14.png)\(c\)Mean per\-token top\-kkrenormalized log\-probability\.
![Refer to caption](https://arxiv.org/html/2605.22981v1/x15.png)\(d\)Teacher\-forced target NLL

Figure 9:Native FIM probing across prefix–suffix geometry\.Metrics are overallrepetition buckets\. The x\-axis variesprefix/suffixlengths\. Shaded bands are nominal 95% confidence intervals\.
### B\.4Qualitative Assessment of Memorization

[Figures10](https://arxiv.org/html/2605.22981#A2.F10),[11](https://arxiv.org/html/2605.22981#A2.F11)and[12](https://arxiv.org/html/2605.22981#A2.F12)show examples of memorized windows that are extractable by both models \([Figure10](https://arxiv.org/html/2605.22981#A2.F10)\), only extractable by the FIM\-model \([Figure11](https://arxiv.org/html/2605.22981#A2.F11)\), and only extractable by the LTR\-model \([Figure12](https://arxiv.org/html/2605.22981#A2.F12)\)\.

![Refer to caption](https://arxiv.org/html/2605.22981v1/x16.png)\(a\)LTR
![Refer to caption](https://arxiv.org/html/2605.22981v1/x17.png)\(b\)FIM

Figure 10:Window that was extracted by both models\. Numbers indicate the top\-kkre\-normalized logits of the displayed true target tokens\. Repetition 128; source book 54068\-0; excerpt 54068\-0::window\_0000; target start 100; prefix length 100 tokens; target length 32 tokens;pzp\_\{z\}\{\}values: LTR=0\.711046, FIM=0\.585069\.![Refer to caption](https://arxiv.org/html/2605.22981v1/x18.png)\(a\)LTR
![Refer to caption](https://arxiv.org/html/2605.22981v1/x19.png)\(b\)FIM

Figure 11:Window that was only extracted by the FIM\-model\. Numbers indicate the top\-kkre\-normalized logits of the displayed true target tokens\. Repetition 128; source book 57335\-0; excerpt 57335\-0::window\_0002; target start 100; prefix length 100 tokens; target length 32 tokens;pzp\_\{z\}\{\}values: LTR=0\.000219776, FIM=0\.204912\.![Refer to caption](https://arxiv.org/html/2605.22981v1/x20.png)\(a\)LTR
![Refer to caption](https://arxiv.org/html/2605.22981v1/x21.png)\(b\)FIM

Figure 12:Window that was only extracted by the LTR\-model\. Numbers indicate the top\-kkre\-normalized logits of the displayed true target tokens\. Repetition 128; source book 11326\-8; excerpt 11326\-8::window\_0003; target start 100; prefix length 100 tokens; target length 32 tokens;pzp\_\{z\}\{\}values: LTR=0\.588202, FIM=0\.00063823\.

Similar Articles

Efficient training of language models to fill in the middle

OpenAI Blog

OpenAI presents a simple data augmentation technique that enables autoregressive language models to perform fill-in-the-middle (FIM) text generation without harming left-to-right performance, with extensive ablations and best practices provided for training such models.

Training-Free Lexical-Dense Fusion for Conversational-Memory Retrieval

arXiv cs.LG

This paper proposes a training-free, CPU-only retrieval method that fuses BM25 lexical scores with late-interaction dense scores for conversational memory retrieval, achieving up to +17.2 points improvement on LoCoMo Hit@1 over late interaction alone across six encoders. The study provides controlled ablations on pooling operators, reranker effects, and benchmark robustness, framing the gain as a division of labor between dense and lexical signals.

Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

arXiv cs.CL

This paper introduces Found in Conversation (FiC), a training framework using View-Asymmetric Self-Distillation to close the multi-turn performance gap in LLMs. The method teaches models to recover single-turn competence from underspecified multi-turn prompts, achieving 92-100% recovery across model families and sizes.