Emergent retokenization symmetry in large language models: phenomenology and applications
Summary
This paper discovers that large language models partially exhibit emergent symmetry under retokenization—replacing a prompt's canonical tokenization with an alternative valid segmentation while preserving bytes exactly. The authors use this phenomenon to probe compositional understanding and propose retokenization as a novel inference-time sampling strategy that can recover solutions not found by conventional temperature sampling.
View Cached Full Text
Cached at: 06/16/26, 11:49 AM
# Emergent retokenization symmetry in large language models: phenomenology and applications
Source: [https://arxiv.org/html/2606.15521](https://arxiv.org/html/2606.15521)
Kanishk Jain∗Matthew Day Tankut Can∗ Department of Physics, Emory University, Atlanta, GA 30322 ∗Correspondence:\{kanishk\.jain,tcan\}@emory\.edu
###### Abstract
Tokenization introduces representational redundancy: under a fixed token vocabulary, every byte string admits many valid token encodings, or segmentations, that decode to the same surface string\. However, given a prompt, most language model tokenizers break this representational symmetry by returning a canonical segmentation\. Training only on canonical segmentations should influence model behavior at inference, and there is little reason to expect the model to respect segmentation symmetry on downstream tasks\. Nevertheless, we find that this symmetry partially emerges during training\. Here, we probe this emergent symmetry through a series of experiments testing token compositional understanding, representation diversity, and task focused benchmark performance\. We primarily useretokenization– replacing a prompt’s canonical tokenization with an alternative valid segmentation while preserving its bytes exactly\. Relative to other prompt perturbations, retokenization is unusually clean because it isolates segmentation effects without changing syntax, semantics or surface form\. We use retokenization to study sensitivity and robustness to semantically identical input representations across pretraining and post\-training\. Moreover, this partial retokenization symmetry suggests a distinct inference\-time sampling axis\. While temperature sampling generates diverse outputs from the model using its next\-token probability distribution, retokenization generates diversity from the model’s internal computations through semantically equivalent input representations\. We find that while this retokenization sampling strategy can hurt performance on easy problems, it can also recover solutions that conventional sampling does not find\. Overall, our work presents retokenization as a simple yet powerful probe of large language models, shedding light on compositional understanding and prompt sensitivity, and offering a novel sampling strategy\.
## 1Introduction
Tokenization is usually discussed as a source of brittleness in large language models \(LLMs\): grouping multiple characters into a single token obscures character structure and can produce failures on spelling\-sensitive or character\-level tasks\(Singh and Strouse,[2024](https://arxiv.org/html/2606.15521#bib.bib48); Edman et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib12); Cosma et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib10); Chai et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib6)\)\. The naive picture often called upon to explain these phenomena is that learning from token sequences does not necessarily produce a full understanding of the constituent structure of tokens\. But this picture is incomplete, as recent work shows that LLMs can often interpret alternative segmentations of the same string and may internally recover whole\-word representations from subword sequences\(Zheng et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib58); Kaplan et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib23)\), suggesting an emergent form of compositional understanding across token boundaries\. Nevertheless, residual differences remain between different segmentations, affecting computation and performanceGeh et al\. \([2025](https://arxiv.org/html/2606.15521#bib.bib16)\)\.
With a fixed vocabulary, the model’scanonical tokenizerreturns only one valid encoding of a given byte string using either dynamic programming or a greedy decoding algorithm\. However, given afixed vocabulary of tokens, many alternatesegmentationscan be created, each represented by a different sequence of tokens that decode to the same surface string \(Figure[2](https://arxiv.org/html/2606.15521#S3.F2)B\)\. We call the alternative non\-canonical segmentationsretokenizations\. For instance, the tokenprobablecan be represented as a two\-token sequenceprobable, if the tokensprobandableare present in the model’s existing vocabulary\. Using the canonical tokenizer during training constrains an LLM’s ability to learn such composition of tokens\. This blindness to subtoken structure has been a source of embarrassment in the past for frontier models which could perform complicated mathematical and coding tasks, but seemingly could not count the letters in a word\(Cosma et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib10); Chai et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib6)\)\.
If training on canonical segmentations produced no invariance at all, non\-canonical segmentations would simply corrupt the prompt\. If invariance were exact, all equivalent segmentations would induce the same computation and retokenization would be behaviorally trivial\. The empirical regime of interest is the intermediate one: trained models may exhibit approximatesegmentation symmetry, meaning they still understand non\-canonical segmentations\(Kaushal and Mahowald,[2022](https://arxiv.org/html/2606.15521#bib.bib24); Itzhak and Levy,[2022](https://arxiv.org/html/2606.15521#bib.bib22); Edman et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib12)\), together with residualsegmentation sensitivity, meaning alternative segmentations of the same prompt produce different internal states and outputs\(Zheng et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib58); Cosma et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib10); Edman et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib12)\)\.
To probe this symmetry–sensitivity tradeoff, we use retokenization as a controlled inference\-time intervention\. Given a fixed vocabulary and byte\-level promptss, letE0\(s\)E^\{0\}\(s\)be the canonical token encoding and letEμ\(s\)E^\{\\mu\}\(s\)be any other valid segmentation ofssunder the same vocabulary\. Retokenization never changes the prompt bytes, introduces no new tokens, and does not define a new task\. It changes only how the same string is decomposed intoexistingtoken vocabulary items\. We use this perturbation in two ways throughout the paper\. First, we use it mechanistically, to study hidden\-state changes under semantically identical inputs\. Second, we use it behaviorally throughRetokenization Sampling, which samples segmentations\{Eμ\(s\)\}μ=1k\\\{E^\{\\mu\}\(s\)\\\}\_\{\\mu=1\}^\{k\}, decodes greedily from each one, and summarizes the resulting success curve withpass@retok\(k\)\\mathrm\{pass@retok\}\(k\), the analogue ofpass@k\\mathrm\{pass@k\}in which diversity comes from input representation rather than output randomness\.
This setup is intentionally narrower than generic prompt perturbation\. Paraphrasing, synonym substitution, reframing, masking and reformatting change the surface string and therefore confound semantics with robustness to wording\(Liu et al\.,[2023](https://arxiv.org/html/2606.15521#bib.bib29); Qiang et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib39); Bsharat and Shen,[2025](https://arxiv.org/html/2606.15521#bib.bib5); Dang et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib11)\)\. Retokenization, by contrast, changes only the token\-level representation of a fixed byte string\. This makes it a particularly clean probe of whether tokenized models process equivalent inputs in a segmentation\-invariant way, and where that invariance breaks down\. Across HumanEval\(Chen et al\.,[2021](https://arxiv.org/html/2606.15521#bib.bib7)\), GSM8KCobbe et al\. \([2021](https://arxiv.org/html/2606.15521#bib.bib9)\), GSM8K Python\(Chowdhery et al\.,[2023](https://arxiv.org/html/2606.15521#bib.bib8)\), and MMLUHendrycks et al\. \([2021](https://arxiv.org/html/2606.15521#bib.bib20)\)benchmarks, we find that equivalent segmentations usually preserve task identity and generate nontrivial diversity in outputs\. Moreover, across OLMo\-2 pretraining and post\-training, we find that byte\-level understanding, necessary for segmentation symmetry, is gradually learned rather than present trivially\.
Our approach is distinct from prior methods that use stochastic tokenization during training, where subword regularization encourages robustness by exposing multiple segmentations during training\(Kudo,[2018](https://arxiv.org/html/2606.15521#bib.bib25); Provilkov et al\.,[2020](https://arxiv.org/html/2606.15521#bib.bib38)\)\. We mainly focus on understanding inference\-time behavior in models that were not explicitly trained for segmentation invariance\. In that setting, any useful diversity produced by equivalent segmentations is an emergent property of the trained model rather than a built\-in invariance constraint\.
This approach also connects tokenization to recent views of reasoning as trajectory selection under finite compute\(Snell et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib50); Merrill and Sabharwal,[2024](https://arxiv.org/html/2606.15521#bib.bib31); Pfau et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib36)\)\. Transformers compute token by token, so changing segmentation changes the granularity and ordering of computation within the model even when the prompt string is unchanged\. In that sense, retokenization is a controlled method of altering the model’s internal processing without changing the prompt itself\. The intriguing mechanism is therefore not just whether alternative segmentations help or hurt benchmark performance, but what they reveal about how invariance and sensitivity coexist inside tokenized models\.
This question is timely because tokenization research increasingly treats tokenizers as more than compression devices\(Bostrom and Durrett,[2020](https://arxiv.org/html/2606.15521#bib.bib4); Zouhar et al\.,[2023](https://arxiv.org/html/2606.15521#bib.bib61); Schmidt et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib43); Ali et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib2); Haslett,[2025](https://arxiv.org/html/2606.15521#bib.bib19)\), while byte\-level and tokenizer\-free models aim to remove the dependency altogether\(Xue et al\.,[2022](https://arxiv.org/html/2606.15521#bib.bib54); Yu et al\.,[2023](https://arxiv.org/html/2606.15521#bib.bib57); Wang et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib53); Pagnoni et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib34)\)\. Our work addresses the complementary regime: as long as deployed LLMs remain tokenized, the multiplicity of valid segmentations is itself a mechanistically meaningful degree of freedom\. In the very least, it cannot be ignored\.
The rest of the paper develops this argument in three steps\. We first ask whether segmentation\-level compositional understanding emerges during training, using hidden state analyses\. We then measure the behavioral consequences of residual segmentation sensitivity across four benchmarks and multiple training stages\. Finally, we compare the diversity induced by retokenization to more conventional sources of output variation to compare segmentation\-induced diversity to standard baselines\.
### 1\.1Our Contributions
We make the following contributions:
1. 1\.Retokenization as a controlled probe of segmentation invariance\.We formalize retokenization as a semantics\-preserving intervention that changes only the token\-level representation of a fixed byte string\. This gives a clean way to probe how tokenized models respond to equivalent inputs without changing wording, syntax, or task definition\.
2. 2\.Internal representations show emergent segmentation symmetry and residual sensitivity\.Using hidden\-state analyses, we show that segmentation\-induced variation is suppressed in early and middle layers but is accentuated in the final layer\.
3. 3\.A concrete evaluation protocol\.We formalize the symmetry–sensitivity tradeoff withpass@retok\(k\)\\mathrm\{pass@retok\}\(k\), the direct analogue of pass@k in which diversity comes from input representation rather than output randomness\. This provides a concrete way to study how segmentation invariance and sensitivity coexist in tokenized language models\.
4. 4\.Task\-dependent gains and robustness trends\.Across GSM8K, GSM8K Python, and HumanEval, we show thatpass@retok\(k\)\\mathrm\{pass@retok\}\(k\)rises withkkand recovers instances missed by canonical decoding\. We also show that across post\-training stages in OLMo\-2 and other model families, stronger models also tend to remain stronger under retokenization\. In other words, retokenization symmetry tends to track model performance\.
5. 5\.Retokenization recovers structurally diverse correct programs\.On HumanEval, AST\-based syntactic\-diversity measurements show that retokenization does not merely reproduce a single canonical solution template\. Correct retokenized generations span a substantial range of program structures, and some tasks are solved only byRetokenization Sampling\.
## 2Related Work
Subword tokenization schemes such as BPE, WordPiece, and SentencePiece were introduced to balance open\-vocabulary coverage with computational efficiency\(Sennrich et al\.,[2016](https://arxiv.org/html/2606.15521#bib.bib46); Kudo and Richardson,[2018](https://arxiv.org/html/2606.15521#bib.bib26); Schuster and Nakajima,[2012](https://arxiv.org/html/2606.15521#bib.bib44); Radford et al\.,[2019](https://arxiv.org/html/2606.15521#bib.bib40)\)\. But tokenization is not just a compression step applied before modeling begins, as the choice of vocabulary and token boundaries changes how text is presented to the model, which features are localized within single tokens versus split across multiple tokens, and how much sequence length is required to express the same input\. Prior work shows that these design choices affect pretraining efficiency, downstream performance, and the kinds of linguistic structure models capture reliably\(Bostrom and Durrett,[2020](https://arxiv.org/html/2606.15521#bib.bib4); Ali et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib2); Zouhar et al\.,[2023](https://arxiv.org/html/2606.15521#bib.bib61); Schmidt et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib43)\)\. Tokenization is therefore an important part of building language models and cannot be treated as a detail that is irrelevant to model behavior\.
In modern LLMs, prior works have found partial but incomplete character and sub\-token level awareness\(Kaushal and Mahowald,[2022](https://arxiv.org/html/2606.15521#bib.bib24); Itzhak and Levy,[2022](https://arxiv.org/html/2606.15521#bib.bib22); Edman et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib12)\)\. Other studies show evidence of better than expected handling of non\-canonical segmentations\(Zheng et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib58)\), alongside systematic failures in character level and arithmetic tasks\(Singh and Strouse,[2024](https://arxiv.org/html/2606.15521#bib.bib48); Cosma et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib10)\)\. Recent mechanistic work studies character\-level tokenization and argue that robustness arises by early\-layer in\-group attention that reconstructs canonical lexical units from fragmented inputsYang et al\. \([2026](https://arxiv.org/html/2606.15521#bib.bib56)\)\. Most closely related,Geh et al\. \([2025](https://arxiv.org/html/2606.15521#bib.bib16)\)show that non\-canonical segmentations can preserve semantics strongly enough to bypass safety filters\. Our use of retokenization is narrower and constructive: we treat it as a semantics\-preserving probe of both robustness and inference\-time diversity\.
Across input perturbation methods such as paraphrases, formatting choices such as delimiters and whitespace, verbalizers, and typos, LLMs have been shown to be particularly sensitive to small changes\(Liu et al\.,[2023](https://arxiv.org/html/2606.15521#bib.bib29); Qiang et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib39); Bsharat and Shen,[2025](https://arxiv.org/html/2606.15521#bib.bib5); Sclar et al\.,[2023](https://arxiv.org/html/2606.15521#bib.bib45); Salinas and Morstatter,[2024](https://arxiv.org/html/2606.15521#bib.bib42); Errica et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib13); Zhuo et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib60); Gan et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib15); Zhu et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib59); Wahle et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib52)\)\. To mitigate tokenization induced input sensitivity, subword regularization methods such as BPE\-dropout expose models to multiple segmentations during training\(Kudo,[2018](https://arxiv.org/html/2606.15521#bib.bib25); Provilkov et al\.,[2020](https://arxiv.org/html/2606.15521#bib.bib38)\), while byte\-level and tokenizer\-free models avoid subword segmentation altogether\(Xue et al\.,[2022](https://arxiv.org/html/2606.15521#bib.bib54); Yu et al\.,[2023](https://arxiv.org/html/2606.15521#bib.bib57); Wang et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib53); Pagnoni et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib34)\)\. Parallel work modifies tokenization for efficiency or adaptation\(Feher et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib14); Geng et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib17); Pilana Liyanage and Yvon,[2026](https://arxiv.org/html/2606.15521#bib.bib37); Kwiatkowski et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib27)\), while multilingual and morphology\-aware work has shown that tokenization can have fairness and language dependent consequences\(Rust et al\.,[2021](https://arxiv.org/html/2606.15521#bib.bib41); Hofmann et al\.,[2021](https://arxiv.org/html/2606.15521#bib.bib21); Petrov et al\.,[2023](https://arxiv.org/html/2606.15521#bib.bib35); Limisiewicz et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib28)\)\.
Despite this broad evidence that surface form matters, most studies measure sensitivity only behaviorally, through accuracy, attack success rate, or efficiency\. Much less is known about whether meaning\-preserving perturbations also change the model’s internal representations\. A few exceptions use representation alignment as a robustness objective\(Agrawal et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib1)\), but they do not characterize how segmentation itself changes representations\.Retokenization Samplingfills this gap by perturbing only the segmentation while preserving the byte string exactly\. Unlike typos, paraphrases, or formatting changes, this gives a controlled way to study how tokenized models process equivalent inputs, and our analysis is, to our knowledge, the first to characterize segmentation\-induced prompt perturbations at the representational level\.
## 3Results
### 3\.1Emergence of Segmentation Symmetry in Representation Space
How do LLMs process non\-canonical segmentations? To answer this, we study internal \(latent\) representations layer\-wise, and throughout pretraining\. More precisely, we explore representations of retokenization in intermediate layers of OLMo\-2\-7B during pretraining\. We provide the model with the input prompt from a selection of HumanEval prompts, and calculate the hidden state of the final token in the prompt, at each layer of the model\. Denote this final\-token hidden state at layerℓ\\ellinduced by promptEμ\(s\)E^\{\\mu\}\(s\)by𝐡ℓμ∈ℝD\{\\bf h\}\_\{\\ell\}^\{\\mu\}\\in\\mathbb\{R\}^\{D\}\. We write the components of this vector ashℓ,iμh\_\{\\ell,i\}^\{\\mu\}fori=1,…,Di=1,\.\.\.,D\. We calculate the displacement from the canonical representation,
𝐱ℓμ=𝐡ℓμ−𝐡ℓ0,\\displaystyle\{\\bf x\}\_\{\\ell\}^\{\\mu\}=\{\\bf h\}\_\{\\ell\}^\{\\mu\}\-\{\\bf h\}\_\{\\ell\}^\{0\},\(1\)where the components are denotedxℓ,iμx\_\{\\ell,i\}^\{\\mu\}\. In Figure[1](https://arxiv.org/html/2606.15521#S3.F1)we report the norm of the average displacementμℓ\(x\)\\mu\_\{\\ell\}\(x\), as well as the variance of displacementVarℓ\(x\)\{\\rm Var\}\_\{\\ell\}\(x\), for different layers:
μℓ\(x\)\\displaystyle\\mu\_\{\\ell\}\(x\)≡‖⟨𝐱ℓ⟩‖,⟨xℓ,i⟩≡1S∑μ=1Sxℓ,iμ,\\displaystyle\\equiv\|\|\\langle\{\\bf x\}\_\{\\ell\}\\rangle\|\|,\\quad\\langle x\_\{\\ell,i\}\\rangle\\equiv\\frac\{1\}\{S\}\\sum\_\{\\mu=1\}^\{S\}x^\{\\mu\}\_\{\\ell,i\},\(2\)Varℓ\(x\)\\displaystyle\{\\rm Var\}\_\{\\ell\}\(x\)≡1\(D−1\)S∑i=1D∑μ=1S\(xℓ,iμ−⟨xℓ,i⟩\)2\.\\displaystyle\\equiv\\frac\{1\}\{\(D\-1\)S\}\\sum\_\{i=1\}^\{D\}\\sum\_\{\\mu=1\}^\{S\}\\left\(x\_\{\\ell,i\}^\{\\mu\}\-\\langle x\_\{\\ell,i\}\\rangle\\right\)^\{2\}\.\(3\)
As a baseline prompt\-perturbation comparison, we also look at displacement under prompts to which typos are added \(for details see[A\.2](https://arxiv.org/html/2606.15521#A1.SS2)\)\. In this setting, the canonical prompt is the same as beforeE0\(s\)E^\{0\}\(s\), while the perturbed prompt has typos injected and is canonically tokenized\. The latent representations are computed in the same way\.
Figure 1:Statistics of Latent Representations under retokenization and typo perturbations\.Retokenization of a prompt displaces latent reps relative to the canonically tokenized prompt\. We show the mean[2](https://arxiv.org/html/2606.15521#S3.E2)\(Top Left\) and variance[3](https://arxiv.org/html/2606.15521#S3.E3)\(Top Right\) of this displacement for OLMo\-2\-7B during pretraining, for a selection of layers\. For comparison, we show behavior for typo pertubations, plotting mean \(Bottom Left\) and variance \(Bottom Right\)\. For both prompt perturbations, the displacement mean converges in all layers except the final layer 32, where it grows\. Note that for OLMo\-2\-7b, layer 32 is the final layer before the unembedding layer, and is related to layer 31 only by an RMSnorm operation\. To generate retokenizations and typos, we usepretok=ptypo=0\.5p\_\{retok\}=p\_\{typo\}=0\.5\.To summarize, under retokenizations, a model’s internal representations pick up a bias and variance\. Both of these summary statistics significantly diminish during training, but do not completely collapse\. Therefore, while the internal representations cluster non\-canonical segmentations, they still encode differences\. We explore this residual sensitivity to segmentation in reasoning taks below\.
### 3\.2Consequences of Residual Segmentation Sensitivity on Reasoning Tasks
The results from previous section suggest that retokenized prompts are assigned different representations inside the model\. But do these differences survive in reasoning traces? In other words, do the initial differences in representations lead to different outputs? To answer this, we useRetokenization Samplingacross benchmarks in three distinct domains: mathematical reasoning \(GSM8K\), code generation \(HumanEval\) and code\-mediated mathematical reasoning \(GSM8K Python\)\. We find that sampling alternative segmentations of a fixed prompt produces useful but task\-dependent diversity\. Concretely,Retokenization Samplingtakes a fixed prompt string, sampleskkvalid non\-canonical segmentations of that same string under the model’s existing vocabulary, runs the model once on each segmented prompt, and decodes greedily from each run\.
We quantify the output behavior withpass@retok\(k\)\\mathrm\{pass@retok\}\(k\), a metric designed to parallel the standardpass@k\\mathrm\{pass@k\}evaluation fromChen et al\. \([2021](https://arxiv.org/html/2606.15521#bib.bib7)\): for a fixed benchmark prompt, we ask how often at least one segmentation out ofkkgives a correct solution\. Because decoding is deterministic, variation inpass@retok\(k\)\\mathrm\{pass@retok\}\(k\)reflects diversity induced by the input representation rather than by output sampling\. Figure[2](https://arxiv.org/html/2606.15521#S3.F2)C shows that these retokenization\-sampling curves have the same qualitative rising\-with\-diminishing\-returns shape aspass@k\\mathrm\{pass@k\}across all four benchmarks, with the largest gap to canonicalpass@k\\mathrm\{pass@k\}on HumanEval and smaller gaps on GSM8K and GSM8K Python\. GSM8K Python falls naturally between GSM8K and HumanEval: the question remains natural language, but the answer space is programmatic\. Figure[2](https://arxiv.org/html/2606.15521#S3.F2)D shows that these aggregate gains are not uniform: the distribution ofΔP=P\(fail∣canon\)−P\(fail∣retok\)\\Delta P=P\(\\mathrm\{fail\}\\mid\\mathrm\{canon\}\)\-P\(\\mathrm\{fail\}\\mid\\mathrm\{retok\}\)places noticeable mass above zero, indicating that retokenization reduces failure probability on a subset of instances rather than only changing the average curve\.
Figure 2:Retokenization sampling as an input\-side search axis\.A\)Conventional sampling fixes the canonical prompt encoding and samples from the output distribution \(at temperature = 1\), whereas retokenization sampling varies the prompt segmentation and decodes greedily\.B\)A single surface string admits many valid tokenizations under a fixed vocabulary; the canonical tokenizer chooses only one of them\.C\)Across HumanEval, GSM8K, and GSM8K Python,pass@retok\\mathrm\{pass@retok\}exhibits the same qualitative behavior aspass@k\\mathrm\{pass@k\}, showing that different segmentations expose distinct useful trajectories\. The shaded area shows standard errors \(see Appendix[B](https://arxiv.org/html/2606.15521#A2)\)\.D\)Histogram of per\-problem failure\-probability differences,ΔP=P\(fail∣canon\)−P\(fail∣retok\)\\Delta P=P\(\\mathrm\{fail\}\\mid\\mathrm\{canon\}\)\-P\(\\mathrm\{fail\}\\mid\\mathrm\{retok\}\)\. Mass above zero indicates problems helped by retokenization\. All experiments presented here are for OLMo\-2 7B Instruct model\.Change in segmentation sensitivity and robustness across trainingWe extend our experiments to track the same OLMo\-2 7B model family across pre\-training and post\-training by obtainingpass@k\\mathrm\{pass@k\}andpass@retok\(k\)\\mathrm\{pass@retok\}\(k\)for the HumanEval benchmark\. These models belong to a sequence of training stages which include, in the order in which they are implemented: base pre\-training, supervised fine\-tuning \(SFT\), direct policy optimization \(DPO\), and reinforcement learning with verified rewards \(Instruct\)\. Figure[3](https://arxiv.org/html/2606.15521#S3.F3)\(left\) shows a gradual increase in bothpass@k\{\\rm pass@k\}andpass@retok\{\\rm pass@retok\}during pre\-training, and a sharp increase in both metrics with post\-training\. The corresponding mean failure probabilities shown in Figure[3](https://arxiv.org/html/2606.15521#S3.F3)\(right\) show a synchronized two\-step decline for both sampling methods, the first occurring in early pre\-training \(before step100000100000\) and a prominent decline appearing at the post\-training SFT stage, consistent with thepass@k\\mathrm\{pass@k\}andpass@retok\(k\)\\mathrm\{pass@retok\}\(k\)trends\. While the evidence and mechanism of improvement inpass@k\{\\rm pass@k\}scores from post\-training has been shown before\(Chen et al\.,[2021](https://arxiv.org/html/2606.15521#bib.bib7); Cobbe et al\.,[2021](https://arxiv.org/html/2606.15521#bib.bib9); Ouyang et al\.,[2022](https://arxiv.org/html/2606.15521#bib.bib33)\), the reason behind a proportionate increase inpass@retok\{\\rm pass@retok\}scores remains unclear\. These results indicate a dominant role of post\-training in the emergence of segmentation invariance in language models\. We also observe the same qualitative pattern across model families: stronger models generally remain stronger under retokenization, with family\-specific variation in the gap to canonical sampling \(Appendix[D\.6](https://arxiv.org/html/2606.15521#A4.SS6)\)\.
Figure 3:Training improves canonical and retokenized sampling together\.Left:Solid curves showpass@k\\mathrm\{pass@k\}and dashed curves showpass@retok\\mathrm\{pass@retok\}for successive OLMo\-2 7B pre\-training checkpoints and post\-training stages on HumanEval\. Each training step shifts both curves upward, with especially large gains from base to SFT and further gains through DPO\. There appear marginal gains from DPO to Instruct\.Right:the corresponding mean per\-problem failure probability⟨P\(fail\)⟩\\langle P\(\\mathrm\{fail\}\)\\rangledecreases across training for both sampling procedures, but canonical sampling remains consistently lower than retokenized sampling\. Segmentation robustness therefore tracks overall task competence rather than emerging as a separate skill\.
### 3\.3Diversity in generations from retokenization and temperature sampling
The previous section suggests that retokenization can be viewed as a source of noise, allowing for a potentially new sampling axis\. However, is there any difference between retokenization and temperature sampling in the space of sampled outputs? To probe this question, we measure syntactic diversity directly in the space of generated programs from HumanEval dataset using OLMo2\-7B\-Instruct model\.
We adapt a previously introduced syntactic diversity metric\(Shypula et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib47)\)by parsing each program generated by the model into a Python abstract syntax tree \(AST\) and extracting a set of bottom\-up subtrees of fixed height \(h=4h=4\)\. We then compare two programs by pooling their extracted subtrees and measuring how many of those pooled subtrees are unique\. IfS1S\_\{1\}andS2S\_\{2\}denote the multisets of subtrees from two programs, we define the distance between the subtree multisets as
dAST\(S1,S2\)=1−2\|S1∩S2\|\|S1\|\+\|S2\|d\_\{AST\}\(S\_\{1\},S\_\{2\}\)=1\-\\frac\{2\\left\|S\_\{1\}\\cap S\_\{2\}\\right\|\}\{\|S\_\{1\}\|\+\|S\_\{2\}\|\}\(4\)
where the numerator counts distinct subtree patterns across both generations, and the denominator counts the total number of subtree instances contributed by the two generations\. This is just a Jaccard distance rescaled to give values in the unit interval\. The highest possible distance is unity, which is obtained whenS1S\_\{1\}andS2S\_\{2\}have no elements in common\. The distance is zero when the sets are identical\. Identifiers, argument names, and literal values are canonicalized before subtree extraction, so the metric emphasizes structural rather than lexical differences\.
Figure 4:Retokenization recovers structurally diverse correct programs \(A,B\)\.Each point is a HumanEval task, comparing mean AST\-based syntactic diversity under retokenization sampling \(x\-axis\) and canonical temperature sampling \(y\-axis\)\. Higher values reflect higher diversity \(error bars reflect standard error\)\.A\)all completions\.B\)only correct completions\. A strong correlation in diversity within correct completions shows that retokenization does not merely reproduce one correct solution; it explores a comparable range of program structures\. Red circles highlight tasks only solved byRetokenization Sampling\.Within\-family and cross\-family diversity are of similar scale \(C,D\)\.Tasks are sorted by within\-canonical diversityd\(C\)d\(C\)\. Blue shows within\-retokenization diversityd\(R\)d\(R\), red shows cross\-family diversityd\(C,R\)d\(C,R\), and green shows within\-canonical diversityd\(C\)d\(C\)\.C\)all completions\.D\)only correct completions\. Retokenized completions typically remain near the canonical diversity scale rather than collapsing to a single template or diverging into a disjoint region of program space\.Figure[4](https://arxiv.org/html/2606.15521#S3.F4)A,B shows mean within\-family syntactic diversity under temperature sampling and retokenization\. The plotted mean for each task is the average of this pairwise score over all within\-family completion pairs, and the error bars are standard errors over those pairwise values\. When all completions are included, most tasks lie below the diagonal, indicating that per\-task diversity is higher under retokenization sampling\. Restricting to only correct completions per task shows a strong correlation, as these generations cover much of the same diversity range as correct temperature\-sampled programs\. The main takeaway is therefore not that retokenization matches temperature sampling exactly, but that it does not collapse onto a single canonical template, despite being greedily decoded\. Instead, correct generations by retokenization often explore a substantial range of structurally distinct correct programs\.
Comparing diversity in temperature sampling and retokenization samplingWhile both temperature sampling andRetokenization Samplinggenerate non\-trivial amounts of syntactic diversity, we don’t know if they are exploring the same solution families\. A stronger comparison is therefore to look at the geometry of the two completion sets directly: for a fixed task, are retokenized programs mostly variations of the canonical temperature\-sampled ones, or do they occupy neighboring but distinct regions of program space? We therefore add a*cross\-family*quantity: for each task, we computedASTd\_\{\\text\{AST\}\}for every pair consisting of one temperature\-sampled completion and one retokenization\-sampled completion, and average over those cross\-family pairs\. LetCCdenote the canonical temperature\-sampled completions for a task andRRthe retokenization\-sampled completions for the same task, and writedijd\_\{ij\}for the pairwisedASTd\_\{\\text\{AST\}\}between completionsiiandjj\. We then compare
d\(C,R\)\\displaystyle d\(C,R\)≡⟨dij⟩i∈C,j∈R≡1\|C\|\|R\|∑i∈C,j∈Rdij,\\displaystyle\\equiv\\langle d\_\{ij\}\\rangle\_\{i\\in C,\\,\\,j\\in R\}\\equiv\\frac\{1\}\{\|C\|\|R\|\}\\sum\_\{i\\in C,\\,\\,j\\in R\}d\_\{ij\},\(5\)d\(C\)\\displaystyle d\(C\)≡⟨dij⟩i<j∈C,d\(R\)≡⟨dij⟩i<j∈R\.\\displaystyle\\equiv\\langle d\_\{ij\}\\rangle\_\{i<j\\in C\},\\qquad d\(R\)\\equiv\\langle d\_\{ij\}\\rangle\_\{i<j\\in R\}\.\(6\)
In Figure[4](https://arxiv.org/html/2606.15521#S3.F4)C,D, sorting tasks byd\(C\)d\(C\)reveals that retokenized completions track the same coarse diversity ordering as canonical completions\. For all completions,d\(R\)d\(R\)andd\(C,R\)d\(C,R\)usually remain close tod\(C\)d\(C\), which means the retokenized solutions neither collapse onto a single canonical program family nor jump to a completely disjoint region of program space\. On the subset of correct completions the curves become noisier, as expected from fewer samples, but the same qualitative conclusion survives\. Cross\-family diversity is of the same order as within\-family diversity, so they don’t separate into obviously disjoint regions of the measured program space\.
### 3\.4Retokenization Samplingcomparison with typo sampling
Typographical perturbations provide a useful control to our method because they also perturb input to the model, but unlike retokenization they do not preserve the exact byte string\. A typo can perturb input representation by changing token boundaries, changing lexical cues, or by genuinely changing meaning\. Retokenization only isolates the first mechanism, thus, typo sampling is a stronger but less controlled perturbation method: if it helps, it is harder to interpret why\. We therefore treat it as a baseline for input\-side diversity here rather than as a direct substitute for retokenization\.
Figure 5:Retokenization versus typo sampling on HumanEval\.Left:pass curves for temperature sampling \(green\), typo sampling \(blue\), and retokenization sampling \(red\) on HumanEval\. Typo sampling consistently outperforms retokenization sampling but remains below conventional temperature sampling across the full range ofkk\.Right:distributions over tasks of the empirical failure probabilityP\(failure\)P\(\\mathrm\{failure\}\)for the same three methods, with dashed vertical lines marking the corresponding means\. Typo perturbations shift mass away from certain failure relative to retokenization, but still leave substantially more hard tasks than canonical output sampling\. This ordering suggests that visible surface perturbations provide a stronger source of input\-side diversity than segmentation alone, though they are also less controlled because they change the prompt text itself\.Figure[5](https://arxiv.org/html/2606.15521#S3.F5)makes this control comparison on HumanEval dataset using OLMo\-2 7B Instruct model\. In the left panel, typo sampling \(see Appendix[A\.2](https://arxiv.org/html/2606.15521#A1.SS2)for implementation details\) consistently sits between canonical temperature sampling and retokenization sampling across the full range ofkk\. The right panel shows the same ordering in a complementary way through the per\-task failure\-probability distributions: canonical sampling concentrates more mass on easy tasks with lowPfailP\_\{fail\}, retokenization leaves the largest concentration near almost\-certain failure, and typo sampling shifts some of that mass toward lower failure probabilities without matching canonical performance\. One interpretation is that visible perturbations buy additional exploration power precisely because they move more than segmentation alone: they can alter lexical cues and local semantics, not just token boundaries\. If typo sampling were helped only by re\-segmentation effects, its distribution should be much closer to retokenization\. Instead, the persistent gap suggests that a substantial part of typo\-induced diversity comes from changing the prompt text itself\. This strengthens the case for retokenization as the cleaner scientific probe\. It is weaker as a search procedure, but when it changes outcomes we can attribute those changes specifically to segmentation rather than to new wording or corrupted semantics\.
### 3\.5Compute comparison across different sampling techniques
Rawpass@k\\mathrm\{pass@k\}andpass@retok\\mathrm\{pass@retok\}curves compare methods at equal numbers of samples, not at equal prompt budgets\. This distinction matters because retokenization typically lengthens the prompt, while temperature sampling leaves the prompt fixed and expends compute only through additional decoded outputs\. Figure[6](https://arxiv.org/html/2606.15521#S3.F6)therefore compares temperature sampling, retokenization, and typo sampling under an explicit prompt\-token budget on HumanEval\. Here,L¯\(k\)\\bar\{L\}\(k\)denotes the expectedprompttoken cost for each sampling process atkksamples\. The left panel shows the cost side of this comparison, where temperature sampling is the cheapest in prompt tokens, typo sampling is intermediate, and retokenization is the most expensive, likely because equivalent segmentations usually expand the input length\. The right panel shows the resulting accuracy frontier under matched prompt budget\. Temperature sampling dominates, typo sampling is second, and retokenization is least compute\-efficient of the three\. We therefore do not claim that retokenization is a drop\-in replacement for conventional sampling under strict budget matching\. Its value is complementary: it offers a semantics\-preserving source of diversity that can recover solutions output\-side sampling misses, even if the present compute frontier still favors canonical temperature sampling\.
Figure 6:Matched\-budget comparison on HumanEval\.Left: Expected prompt\-token costL¯\(k\)\\bar\{L\}\(k\)as a function of sample countkkfor temperature sampling, retokenization sampling, and typo sampling\. Retokenization is most expensive because alternative segmentations typically lengthen the prompt\. Right: pass rate versus expected prompt tokens\. Temperature sampling gives the best compute frontier, typo sampling is intermediate, and retokenization is least efficient under this budget, indicating that its main value is complementary diversity rather than raw cost\-effectiveness\.
## 4Conclusion
Equivalent segmentations of the same byte string define a latent degree of freedom that tokenized language models do not treat as irrelevant\. Our results show that modern LLMs learn substantial, but incomplete, segmentation invariance: retokenized prompts usually preserve task understanding, yet still perturb the internal computation enough to generate new successful trajectories\. This makes retokenization scientifically useful as a probe of compositional understanding and practically interesting as a semantics\-preserving source of inference\-time diversity\.
The picture that emerges is not one of exact symmetry\. Behaviorally, this residual sensitivity is most useful on open\-ended domains such as code generation, where multiple correct trajectories exist and equivalent segmentations can steer the model toward different algorithmic families\. At the same time, matched\-budget comparisons make clear that retokenization is not currently a more compute\-efficient substitute for conventional temperature sampling\. Whether these findings apply to models trained on other languages remains to be explored, and it is unknown whether specific retokenizations help or hurt in specific tasks\. Moreover, while we particularly use a stochastic retokenization procedure under a fixed vocabulary, different segmentation policies may yield different tradeoffs between robustness and diversity\.
The broader implication is that tokenization should be understood not just as preprocessing, but as part of the model’s computational interface to text\. As long as deployed LLMs remain tokenized, the space of equivalent segmentations offers a controlled way to study how invariance and sensitivity coexist in sequence models, and how that tradeoff shapes both capability and robustness\.
## Acknowledgments and Disclosure of Funding
We acknowledge helpful conversations with Roberto Avalos, David Berman, Andrey Gromov, Matt Kleban, Noam Levi, Ilya Nemenman, Sean Ridout and Marat Freytsis\. We acknowledge the support and computing resources provided by the HyPER C3 computing cluster at Emory University\.
TC acknowledges the support of Emory University\.
## References
- Agrawal et al\. \[2025\]Aryan Agrawal, Lisa Alazraki, Shahin Honarvar, Thomas Mensink, and Marek Rei\.Enhancing LLM robustness to perturbed instructions: An empirical study\.In*ICLR 2025 Workshop on Building Trust in Language Models and Applications*, 2025\.URL[https://arxiv\.org/abs/2504\.02733](https://arxiv.org/abs/2504.02733)\.
- Ali et al\. \[2024\]Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, Charvi Jain, Alexander Weber, Lena Jurkschat, Hammam Abdelwahab, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Samuel Weinbach, Rafet Sifa, Stefan Kesselheim, and Nicolas Flores\-Herr\.Tokenizer choice for LLM training: Negligible or crucial?In Kevin Duh, Helena Gomez, and Steven Bethard, editors,*Findings of the Association for Computational Linguistics: NAACL 2024*, pages 3907–3924, Mexico City, Mexico, 2024\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2024\.findings\-naacl\.247/](https://aclanthology.org/2024.findings-naacl.247/)\.
- Biderman et al\. \[2023\]Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al\.Pythia: A suite for analyzing large language models across training and scaling\.In*International conference on machine learning*, pages 2397–2430\. PMLR, 2023\.
- Bostrom and Durrett \[2020\]Kaj Bostrom and Greg Durrett\.Byte pair encoding is suboptimal for language model pretraining\.In Trevor Cohn, Yulan He, and Yang Liu, editors,*Findings of the Association for Computational Linguistics: EMNLP 2020*, pages 4617–4624, Online, 2020\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2020\.findings\-emnlp\.414/](https://aclanthology.org/2020.findings-emnlp.414/)\.
- Bsharat and Shen \[2025\]Sondos Mahmoud Bsharat and Zhiqiang Shen\.Prompting test\-time scaling is a strong llm reasoning data augmentation\.*arXiv preprint arXiv:2510\.09599*, 2025\.URL[https://arxiv\.org/abs/2510\.09599](https://arxiv.org/abs/2510.09599)\.
- Chai et al\. \[2024\]Yekun Chai, Yewei Fang, Qiwei Peng, and Xuhong Li\.Tokenization falling short: On subword robustness in large language models\.In Yaser Al\-Onaizan, Mohit Bansal, and Yun\-Nung Chen, editors,*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 1582–1599, Miami, Florida, USA, 2024\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2024\.findings\-emnlp\.86/](https://aclanthology.org/2024.findings-emnlp.86/)\.
- Chen et al\. \[2021\]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, and Greg Brockman et al\.Evaluating large language models trained on code\.*arXiv:2107\.03374*, 2021\.URL[https://arxiv\.org/abs/2107\.03374](https://arxiv.org/abs/2107.03374)\.
- Chowdhery et al\. \[2023\]Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, and Sebastian Gehrmann et al\.Palm: Scaling language modeling with pathways\.*Journal of Machine Learning Research*, 24\(240\):1–113, 2023\.URL[http://jmlr\.org/papers/v24/22\-1144\.html](http://jmlr.org/papers/v24/22-1144.html)\.
- Cobbe et al\. \[2021\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.URL[https://arxiv\.org/abs/2110\.14168](https://arxiv.org/abs/2110.14168)\.
- Cosma et al\. \[2025\]Adrian Cosma, Stefan Ruseti, Emilian Radoi, and Mihai Dascalu\.The strawberry problem: Emergence of character\-level understanding in tokenized language models\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 28252–28263, Suzhou, China, 2025\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2025\.emnlp\-main\.1434/](https://aclanthology.org/2025.emnlp-main.1434/)\.
- Dang et al\. \[2025\]Yizhou Dang, Yuting Liu, Enneng Yang, Minhan Huang, Guibing Guo, Jianzhe Zhao, and Xingwei Wang\.Data augmentation as free lunch: Exploring the test\-time augmentation for sequential recommendation\.In*Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval*, SIGIR ’25, page 1466–1475, New York, NY, USA, 2025\. Association for Computing Machinery\.ISBN 9798400715921\.URL[https://arxiv\.org/abs/2504\.04843](https://arxiv.org/abs/2504.04843)\.
- Edman et al\. \[2024\]Lukas Edman, Helmut Schmid, and Alexander Fraser\.CUTE: Measuring LLMs’ understanding of their tokens\.In Yaser Al\-Onaizan, Mohit Bansal, and Yun\-Nung Chen, editors,*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 3017–3026, Miami, Florida, USA, 2024\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2024\.emnlp\-main\.177/](https://aclanthology.org/2024.emnlp-main.177/)\.
- Errica et al\. \[2025\]Federico Errica, Giuseppe Siracusano, Davide Sanvito, and Roberto Bifulco\.What did i do wrong? quantifying llms’ sensitivity and consistency to prompt engineering\.In*Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\)*, 2025\.URL[https://arxiv\.org/abs/2406\.12334](https://arxiv.org/abs/2406.12334)\.
- Feher et al\. \[2025\]Darius Feher, Ivan Vulić, and Benjamin Minixhofer\.Retrofitting large language models with dynamic tokenization\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 29866–29883, Vienna, Austria, 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-251\-0\.URL[https://aclanthology\.org/2025\.acl\-long\.1444/](https://aclanthology.org/2025.acl-long.1444/)\.
- Gan et al\. \[2024\]Esther Gan, Yiran Zhao, Liying Cheng, Mao Yancan, Anirudh Goyal, Kenji Kawaguchi, Min\-Yen Kan, and Michael Shieh\.Reasoning robustness of LLMs to adversarial typographical errors\.In Yaser Al\-Onaizan, Mohit Bansal, and Yun\-Nung Chen, editors,*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 10449–10459, Miami, Florida, USA, 2024\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2024\.emnlp\-main\.584/](https://aclanthology.org/2024.emnlp-main.584/)\.
- Geh et al\. \[2025\]Renato Geh, Zilei Shao, and Guy Van Den Broeck\.Adversarial tokenization\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 20738–20765, Vienna, Austria, 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-251\-0\.URL[https://aclanthology\.org/2025\.acl\-long\.1012/](https://aclanthology.org/2025.acl-long.1012/)\.
- Geng et al\. \[2025\]Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, and Robert West\.zip2zip: Inference\-time adaptive vocabularies for language models via token compression\.In*Tokenization Workshop*, 2025\.URL[https://arxiv\.org/abs/2506\.01084](https://arxiv.org/abs/2506.01084)\.
- Grattafiori et al\. \[2024\]Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, and Amy et al\. Yang\.The Llama 3 Herd of Models\.*arXiv preprint arXiv:2407\.21783*, 2024\.URL[https://arxiv\.org/abs/2407\.21783v3](https://arxiv.org/abs/2407.21783v3)\.
- Haslett \[2025\]David A\. Haslett\.Tokenization changes meaning in large language models: Evidence from Chinese\.*Computational Linguistics*, 51\(3\):785–814, 2025\.URL[https://aclanthology\.org/2025\.cl\-3\.3/](https://aclanthology.org/2025.cl-3.3/)\.
- Hendrycks et al\. \[2021\]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\.Measuring massive multitask language understanding\.*Proceedings of the International Conference on Learning Representations \(ICLR\)*, 2021\.
- Hofmann et al\. \[2021\]Valentin Hofmann, Janet Pierrehumbert, and Hinrich Schütze\.Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 3594–3608, 2021\.URL[https://aclanthology\.org/2021\.acl\-long\.279/](https://aclanthology.org/2021.acl-long.279/)\.
- Itzhak and Levy \[2022\]Itay Itzhak and Omer Levy\.Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens\.In*Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pages 5061–5068, 2022\.URL[https://aclanthology\.org/2022\.naacl\-main\.373/](https://aclanthology.org/2022.naacl-main.373/)\.
- Kaplan et al\. \[2025\]Guy Kaplan, Matanel Oren, Yuval Reif, and Roy Schwartz\.From tokens to words: On the inner lexicon of llms\.In*International Conference on Learning Representations*, volume 2025, pages 70068–70092, 2025\.URL[https://arxiv\.org/abs/2410\.05864](https://arxiv.org/abs/2410.05864)\.
- Kaushal and Mahowald \[2022\]Ayush Kaushal and Kyle Mahowald\.What do tokens know about their characters and how do they know it?In*Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pages 2487–2507, 2022\.URL[https://aclanthology\.org/2022\.naacl\-main\.179/](https://aclanthology.org/2022.naacl-main.179/)\.
- Kudo \[2018\]Taku Kudo\.Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates\.In*Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 66–75, 2018\.URL[https://aclanthology\.org/P18\-1007/](https://aclanthology.org/P18-1007/)\.
- Kudo and Richardson \[2018\]Taku Kudo and John Richardson\.SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing\.In Eduardo Blanco and Wei Lu, editors,*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations*, pages 66–71, Brussels, Belgium, 2018\. Association for Computational Linguistics\.URL[https://aclanthology\.org/D18\-2012/](https://aclanthology.org/D18-2012/)\.
- Kwiatkowski et al\. \[2024\]Robert Kwiatkowski, Zijian Wang, Robert Giaquinto, Varun Kumar, Xiaofei Ma, Anoop Deoras, Bing Xiang, and Ben Athiwaratkun\.FusionToken: Enhancing Compression and Efficiency in Language Model Tokenization\.In*The Twelfth International Conference on Learning Representations*\. ICML, 2024\.URL[https://icml\.cc/virtual/2023/28406](https://icml.cc/virtual/2023/28406)\.
- Limisiewicz et al\. \[2024\]Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, and Luke Zettlemoyer\.MYTE: Morphology\-driven byte encoding for better and fairer multilingual language modeling\.In Lun\-Wei Ku, Andre Martins, and Vivek Srikumar, editors,*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 15059–15076, Bangkok, Thailand, 2024\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2024\.acl\-long\.804/](https://aclanthology.org/2024.acl-long.804/)\.
- Liu et al\. \[2023\]Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig\.Pre\-train, prompt, and predict: A systematic survey of prompting methods in natural language processing\.*ACM Comput\. Surv\.*, 55\(9\), 2023\.ISSN 0360\-0300\.URL[https://arxiv\.org/abs/2107\.13586](https://arxiv.org/abs/2107.13586)\.
- Merity et al\. \[2017\]Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher\.Pointer sentinel mixture models\.In*International Conference on Learning Representations \(ICLR\)*, 2017\.URL[https://arxiv\.org/abs/1609\.07843](https://arxiv.org/abs/1609.07843)\.
- Merrill and Sabharwal \[2024\]William Merrill and Ashish Sabharwal\.The expressive power of transformers with chain of thought\.In*International Conference on Learning Representations*, volume 2024, pages 7690–7706, 2024\.URL[https://arxiv\.org/abs/2310\.07923](https://arxiv.org/abs/2310.07923)\.
- OLMo et al\. \[2025\]Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, and Matt Jordan et al\.2 OLMo 2 Furious\.*arXiv:2501\.00656*, 2025\.URL[http://arxiv\.org/abs/2501\.00656](http://arxiv.org/abs/2501.00656)\.
- Ouyang et al\. \[2022\]Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L\. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe\.Training language models to follow instructions with human feedback\.In*Proceedings of the 36th International Conference on Neural Information Processing Systems*, NIPS ’22, Red Hook, NY, USA, 2022\. Curran Associates Inc\.ISBN 9781713871088\.
- Pagnoni et al\. \[2025\]Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srini Iyer\.Byte latent transformer: Patches scale better than tokens\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 9238–9258, Vienna, Austria, 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-251\-0\.URL[https://aclanthology\.org/2025\.acl\-long\.453/](https://aclanthology.org/2025.acl-long.453/)\.
- Petrov et al\. \[2023\]Aleksandar Petrov, Emanuele La Malfa, Philip H\.S\. Torr, and Adel Bibi\.Language model tokenizers introduce unfairness between languages\.In*Proceedings of the 37th International Conference on Neural Information Processing Systems*, NIPS ’23, Red Hook, NY, USA, 2023\. Curran Associates Inc\.
- Pfau et al\. \[2024\]Jacob Pfau, William Merrill, and Samuel R\. Bowman\.Let’s think dot by dot: Hidden computation in transformer language models\.In*Proceedings of the Conference on Language Modeling \(COLM\)*, 2024\.URL[https://arxiv\.org/abs/2404\.15758](https://arxiv.org/abs/2404.15758)\.
- Pilana Liyanage and Yvon \[2026\]Vijini Pilana Liyanage and François Yvon\.AdaptBPE: From General Purpose to Specialized Tokenizers\.In*Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 2607–2620, 2026\.URL[https://aclanthology\.org/2026\.eacl\-long\.119/](https://aclanthology.org/2026.eacl-long.119/)\.
- Provilkov et al\. \[2020\]Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita\.BPE\-dropout: Simple and effective subword regularization\.In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 1882–1892, Online, 2020\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2020\.acl\-main\.170/](https://aclanthology.org/2020.acl-main.170/)\.
- Qiang et al\. \[2024\]Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, and Aram Galstyan\.Prompt Perturbation Consistency Learning for Robust Language Models\.In*Findings of the Association for Computational Linguistics: EACL 2024*, pages 1357–1370, 2024\.URL[https://aclanthology\.org/2024\.findings\-eacl\.91/](https://aclanthology.org/2024.findings-eacl.91/)\.
- Radford et al\. \[2019\]Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever\.Language models are unsupervised multitask learners\.2019\.URL[https://cdn\.openai\.com/better\-language\-models/language\_models\_are\_unsupervised\_multitask\_learners\.pdf](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)\.
- Rust et al\. \[2021\]Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych\.How good is your tokenizer? on the monolingual performance of multilingual language models\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, ACL 2021, Online, August 1\-6, 2021*, pages 3118–3135, 2021\.URL[https://arxiv\.org/abs/2012\.15613](https://arxiv.org/abs/2012.15613)\.
- Salinas and Morstatter \[2024\]Abel Salinas and Fred Morstatter\.The butterfly effect of altering prompts: How small changes and jailbreaks affect large language model performance\.In Lun\-Wei Ku, Andre Martins, and Vivek Srikumar, editors,*Findings of the Association for Computational Linguistics: ACL 2024*, pages 4629–4651, Bangkok, Thailand, 2024\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2024\.findings\-acl\.275/](https://aclanthology.org/2024.findings-acl.275/)\.
- Schmidt et al\. \[2024\]Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner\.Tokenization Is More Than Compression\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 678–702, 2024\.URL[https://aclanthology\.org/2024\.emnlp\-main\.40/](https://aclanthology.org/2024.emnlp-main.40/)\.
- Schuster and Nakajima \[2012\]Mike Schuster and Kaisuke Nakajima\.Japanese and Korean voice search\.In*2012 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\)*, pages 5149–5152, 2012\.URL[https://ieeexplore\.ieee\.org/document/6289079](https://ieeexplore.ieee.org/document/6289079)\.
- Sclar et al\. \[2023\]Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr\.Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting\.*arXiv preprint arXiv:2310\.11324*, 2023\.URL[https://arxiv\.org/abs/2310\.11324](https://arxiv.org/abs/2310.11324)\.
- Sennrich et al\. \[2016\]Rico Sennrich, Barry Haddow, and Alexandra Birch\.Neural Machine Translation of Rare Words with Subword Units\.In*Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 1715–1725, 2016\.URL[https://aclanthology\.org/P16\-1162/](https://aclanthology.org/P16-1162/)\.
- Shypula et al\. \[2025\]Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, and Osbert Bastani\.Evaluating the diversity and quality of LLM generated content\.In*Second Conference on Language Modeling*, 2025\.URL[https://arxiv\.org/abs/2504\.12522](https://arxiv.org/abs/2504.12522)\.
- Singh and Strouse \[2024\]Aaditya K\. Singh and D\. J\. Strouse\.Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs\.*arXiv preprint arXiv:2402\.14903*, 2024\.URL[http://arxiv\.org/abs/2402\.14903](http://arxiv.org/abs/2402.14903)\.
- Sivan and Tsodyks \[2025\]Doron Sivan and Misha Tsodyks\.Information rate of meaningful communication\.*Proceedings of the National Academy of Sciences*, 122\(25\):e2502353122, 2025\.URL[https://pnas\.org/doi/10\.1073/pnas\.2502353122](https://pnas.org/doi/10.1073/pnas.2502353122)\.
- Snell et al\. \[2024\]Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar\.Scaling llm test\-time compute optimally can be more effective than scaling model parameters, 2024\.URL[https://arxiv\.org/abs/2408\.03314](https://arxiv.org/abs/2408.03314)\.
- Team et al\. \[2025\]Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, and Morgane Rivière et al\.Gemma 3 Technical Report\.*arXiv:2503\.19786v1*, 2025\.URL[https://arxiv\.org/abs/2503\.19786v1](https://arxiv.org/abs/2503.19786v1)\.
- Wahle et al\. \[2024\]Jan Philip Wahle, Terry Ruas, Yang Xu, and Bela Gipp\.Paraphrase types elicit prompt engineering capabilities\.In Yaser Al\-Onaizan, Mohit Bansal, and Yun\-Nung Chen, editors,*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 11004–11033, Miami, Florida, USA, 2024\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2024\.emnlp\-main\.617/](https://aclanthology.org/2024.emnlp-main.617/)\.
- Wang et al\. \[2024\]Junxiong Wang, Tushaar Gangavarapu, Jing Nathan Yan, and Alexander M Rush\.Mambabyte: Token\-free selective state space model\.In*Proceedings of Conference on Language Modeling \(COLM\)*, 2024\.URL[https://arxiv\.org/abs/2401\.13660](https://arxiv.org/abs/2401.13660)\.
- Xue et al\. \[2022\]Linting Xue, Aditya Barua, Noah Constant, Rami Al\-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel\.ByT5: Towards a token\-free future with pre\-trained byte\-to\-byte models\.*Transactions of the Association for Computational Linguistics*, 10:291–306, 2022\.URL[https://aclanthology\.org/2022\.tacl\-1\.17/](https://aclanthology.org/2022.tacl-1.17/)\.
- Yang et al\. \[2025\]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, and Chenxu Lv et al\.Qwen3 Technical Report\.*arXiv:2505\.09388*, 2025\.URL[https://arxiv\.org/abs/2505\.09388v1](https://arxiv.org/abs/2505.09388v1)\.
- Yang et al\. \[2026\]Zhipeng Yang, Shu Yang, Lijie Hu, and Di Wang\.Word Recovery in Large Language Models Enables Character\-Level Tokenization Robustness\.*arXiv:2603\.10771*, 2026\.URL[https://arxiv\.org/abs/2603\.10771v1](https://arxiv.org/abs/2603.10771v1)\.
- Yu et al\. \[2023\]Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis\.Megabyte: Predicting million\-byte sequences with multiscale transformers\.In*Advances in Neural Information Processing Systems 36 \(NeurIPS 2023\)*, 2023\.URL[https://arxiv\.org/abs/2305\.07185](https://arxiv.org/abs/2305.07185)\.
- Zheng et al\. \[2025\]Brian Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, and Noah Smith\.Broken tokens? your language model can secretly handle non\-canonical tokenizations\.In D\. Belgrave, C\. Zhang, H\. Lin, R\. Pascanu, P\. Koniusz, M\. Ghassemi, and N\. Chen, editors,*Advances in Neural Information Processing Systems*, volume 38, pages 30322–30349\. Curran Associates, Inc\., 2025\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2025/file/2b5c5689fae6fa9a4883e73e511d52c8\-Paper\-Conference\.pdf](https://proceedings.neurips.cc/paper_files/paper/2025/file/2b5c5689fae6fa9a4883e73e511d52c8-Paper-Conference.pdf)\.
- Zhu et al\. \[2024\]Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Gong, and Xing Xie\.Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts\.In*Proceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis*, LAMPS ’24, page 57–68, New York, NY, USA, 2024\. Association for Computing Machinery\.ISBN 9798400712098\.URL[https://arxiv\.org/abs/2306\.04528](https://arxiv.org/abs/2306.04528)\.
- Zhuo et al\. \[2024\]Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen\.Prosa: Assessing and understanding the prompt sensitivity of llms\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 1950–1976, 2024\.URL[arxiv\.org/abs/2410\.12405](https://arxiv.org/html/2606.15521v1/arxiv.org/abs/2410.12405)\.
- Zouhar et al\. \[2023\]Vilém Zouhar, Clara Meister, Juan Gastaldi, Li Du, Mrinmaya Sachan, and Ryan Cotterell\.Tokenization and the noiseless channel\.In Anna Rogers, Jordan Boyd\-Graber, and Naoaki Okazaki, editors,*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 5184–5207, Toronto, Canada, 2023\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2023\.acl\-long\.284/](https://aclanthology.org/2023.acl-long.284/)\.
## Appendix AExperimental Details
This section records the sampling and scoring conventions used to keep canonical, retokenized, and typo\-based comparisons aligned across datasets\.
### A\.1Stochastic Retokenization
To generate a non\-canonical tokenizationEμ\(s\)E^\{\\mu\}\(s\)of a stringss, we first generate the canonical tokenizationE0\(s\)E^\{0\}\(s\)using the model tokenizer\. Each resulting canonical token is then independently reconsidered for re\-segmentation with a ratepretokp\_\{retok\}\. Re\-segmenting involves finding a random segmentation of a token using shorter tokens that must exist in the tokenizer vocabulary\. To do this, we use an existing algorithm used in\[Zheng et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib58)\]that samples uniformly from all valid ways of segmenting the original token\. This is achieved by treating token segmentation as a recursive decomposition problem and sampling paths in the segmentation tree with probabilities proportional to the number of valid completions, ensuring that each valid segmentation is equally likely\. Tokens not selected for re\-segmentation \(1−pretok1\-p\_\{retok\}\) are left unchanged\. The final token sequence is formed by concatenating the \(possibly re\-segmented\) tokens in order\. By tuningpretokp\_\{retok\}, we smoothly interpolate between fully deterministic canonical tokenization \(pretok=0p\_\{retok\}=0\) and fully stochastic tokenization applied to every token \(pretok=1p\_\{retok\}=1\)\.
We apply this procedure only to task content tokens\. Special tokens, chat\-template markers, and other control symbols remain in canonical form\. This ensures that any measured effect is attributable to segmentation of the user\-visible task text rather than corruption of the prompt wrapper\.
### A\.2Typographical Perturbations
As an input\-side baseline, we also evaluate typo\-based perturbations\. These runs are constructed analogously to the retokenization runs: for each problem we create multiple perturbed copies of the prompt, vary a corruption probability, and decode from each perturbed prompt using the same model and generation settings as in the corresponding retokenization experiment\. The goal is not to study a particular typo taxonomy, but to provide a comparison class of small surface\-form perturbations that change the visible text rather than only its segmentation\.
For typo\-based perturbations, we vary a corruption probability parameterptypop\_\{\\mathrm\{typo\}\}analogously to the retokenization ratepretokp\_\{retok\}\. Given a prompt, we first identify the eligible tokens in the perturbable region, namely tokens whose decoded text contains at least one mutable alphabetic character\. We then draw a number of token\-level typo operations from a binomial distribution with parameterptypop\_\{\\mathrm\{typo\}\}, sample that many eligible character tokens, and apply a random local typo within each selected token by replacing one randomly chosen mutable character by a keyboard\-adjacent substitute or a doubled\-letter variant \(e\.g\., a→\\rightarrowq,w,s,z,x,aa; no insertion, deletion, or transposition operators are used\)\. Thus, increasingptypop\_\{\\mathrm\{typo\}\}increases the expected number of perturbed tokens in the prompt, while keeping the perturbation local and sparse at smallptypop\_\{\\mathrm\{typo\}\}\. This makesptypop\_\{\\mathrm\{typo\}\}the typo analogue ofpretokp\_\{retok\}: both parameters interpolate between the unmodified prompt and progressively more perturbed variants, but in the typo case the perturbation changes the visible surface form rather than only its segmentation\.
The scope of the perturbation is dataset dependent\. For MMLU and standard GSM8K, typos are injected only into the natural\-language problem text, leaving system prompts, answer\-format markers, and other control tokens unchanged\. For HumanEval and GSM8K Python, we apply typos only within the docstring region of the prompt\. We do not perturb the function signature, indentation, or surrounding Python scaffold\. This restriction is important: it keeps the task well\-formed as a code\-completion problem and ensures that any observed changes are attributable to altered natural\-language instructions rather than to syntactic corruption of the program context\.
### A\.3Calculating pass@ metrics
Each questioniifrom a dataset of sizeNNis replicatedr=30r=30times for a fixedpretokp\_\{retok\}, and stochastic tokenization is applied independently to each replicate by re\-segmenting canonical tokens with probabilitypretokp\_\{retok\}\. Forpretokp\_\{retok\}= 0, this replication is skipped since it generates the canonical tokenization\. We sweep over 6 values ofpretokp\_\{retok\}∈\{0\.0,0\.2,0\.4,0\.6,0\.8,1\.0\}\\in\\\{0\.0,0\.2,0\.4,0\.6,0\.8,1\.0\\\}, obtainingR=\(r×5\+1\)R=\(r\\times 5\+1\)tokenizations for each prompt\. The resulting tokenized inputs are then passed to the model, and inference is performed using greedy decoding \(i\.e\., no output sampling\)\. Thus, for each questionii, we record how many generations \(out ofRR\) produce the correct answer\. This definespass@retok\{\\rm pass@retok\}, directly analogous topass@k\{\\rm pass@k\}, except that diversity arises from stochastic tokenization rather than output sampling\. All experiments are run with the same model, prompt formatting \(specific to the benchmark dataset\), and decoding parameters, differing only in the retokenization probability\. We reportpass@retok\{\\rm pass@retok\}as a function of the number of re\-tokenized trials\. To calculatepass@k\{\\rm pass@k\}, we use top\-p sampling \(p=0\.9p=0\.9,temperature=1temperature=1\) to generateRRgenerations per questionii\.
For computingpass@retok\{\\rm pass@retok\}, we follow almost identical logic\. In this case,nnis the number of retokenizations, including the canonical tokenization, andccis the total number of these which, under greedy decoding, generate the correct answer\.
We use a dataset specific prompt formatting and answer parsing, as described in sections below\.
### A\.4Knowledge Retrieval
Massive Multitask Language Understanding \(MMLU, introduced in\[Hendrycks et al\.,[2021](https://arxiv.org/html/2606.15521#bib.bib20)\]\) is a standard evaluation benchmark comprises of multiple choice questions from a wide variety of subjects\. In our experiments we use a subset of 1000 questions sampled randomly from the MMLU dataset\. We format each question as a zero\-shot multiple\-choice prompt, with the model required to output one of the possible answer choicesA,B,C,D\{A,B,C,D\}\. Given an input like this,
<im\_start\>system You are a helpful assistant\. For the following question, return the answer only, without any additional reasoning or explanation\. <im\_end\><im\_start\>user Question:\(ℤ,∗\)\(\\mathbb\{Z\},\*\)is a group witha∗b=a\+b\+1a\*b=a\+b\+1for alla,b∈ℤa,b\\in\\mathbb\{Z\}\. The inverse ofaais A\. 0 B\. \-2 C\.a−2a\-2 D\.\(2\+a\)∗−1\(2\+a\)\*\-1 Answer:<im\_end\><im\_start\>assistant
we capture the first token generated by the model\. If the generated token is not in\{A,B,C,D\}\\\{A,B,C,D\\\}, we consider the output to be incorrect, otherwise we compare the generated answer with the true answer option\. To calculatepass@retok\{\\rm pass@retok\}, we only re\-tokenize the tokens that represent question and its option in the above prompt, leaving the system prompt and special tokens \(such as <im\_start\>user\) in their canonical form\.
Because MMLU is multiple choice with only four valid outputs, naively resampling the answer token can make pass@k\-style metrics appear artificially strong even when the underlying reasoning has not improved\. We therefore score MMLU from the first answer token only and use the canonical next\-token answer probabilities for the temperature baseline discussed above\. The appendix random\-sampling plot illustrates why this convention is necessary\.
### A\.5Mathematical Reasoning
GSM8K\[Cobbe et al\.,[2021](https://arxiv.org/html/2606.15521#bib.bib9)\]is a standard benchmark for evaluating mathematical reasoning on grade\-school–level word problems\. It consists of free\-response questions that require computing a numerical answer rather than selecting from predefined options\. In our experiments, we use a randomly sampled subset of 1000 questions from the GSM8K dataset\. We do not apply any chat template to the questions as done in the previous datasets\.
<pad\><im\_start\>system You are a helpful assistant\. <im\_end\><im\_start\>user Question: Given a 7\-day week, how much does Alex charge for 2 weeks of tutoring if she charges $12 per day? Answer: <im\_end\><im\_start\>assistant
Given this input, we generate a single completion per prompt and extract the model’s predicted numeric answer from the generated text using a regular expression that selects the final number produced\. The prediction is considered correct if this value exactly matches the ground\-truth answer\.
This exact\-match criterion is intentionally strict\. It penalizes both arithmetic mistakes and reasoning traces that end with an incorrect final scalar, making GSM8K a useful middle ground between the rigid answer space of MMLU and the open\-ended program space of HumanEval\.
### A\.6GSM8K Python
Inspired by previous work\[Chowdhery et al\.,[2023](https://arxiv.org/html/2606.15521#bib.bib8)\], we also evaluate a code\-mediated variant of GSM8K in which each word problem is converted into a Python function\-completion task\. Rather than asking the model to state the final numeric answer in natural language, we prompt it to write the body of a function whose return value should equal the correct answer\. This benchmark sits between GSM8K and HumanEval: the underlying reasoning problem is mathematical, but the output space is programmatic\.
Starting from the same GSM8K test questions, we build 1000 problem instances by sampling with a fixed random seed\. Each prompt has the form
def function\(\): """Given a 7\-day week, how much does Alex charge for 2 weeks of tutoring if she charges $12 per day?"""
where the natural\-language question is placed inside the docstring of a Python function\. The model must complete the function body\. After generation, we extract the code corresponding tofunctionand evaluate it using an automatically constructed test harness\. The harness executes the generated function, checks that it returns a finite numeric value, and compares that value against the gold GSM8K answer usingmath\.isclosewith absolute tolerance10−610^\{\-6\}\. As in HumanEval[A\.7](https://arxiv.org/html/2606.15521#A1.SS7), correctness is therefore functional rather than string\-based\.
This dataset is useful for our purposes because it separates two roles that are entangled in standard GSM8K\. The question itself remains natural language, so retokenization still acts on ordinary text, but the answer must be realized as executable code\. This makes GSM8K Python a cleaner bridge between free\-response math and full code generation, and it is the benchmark on which our syntactic\-diversity analysis can be extended beyond HumanEval\. In the typo setting, perturbations are applied only to the docstring region so that the surrounding Python scaffold remains syntactically well\-formed\.
### A\.7Code Generation
HumanEval\[Chen et al\.,[2021](https://arxiv.org/html/2606.15521#bib.bib7)\]is a standard benchmark for evaluating code\-generation and functional correctness on small programming tasks\. It consists of 164 Python problems, each with a function signature, a natural\-language docstring, and hidden unit tests\. In our experiments, we evaluate on the full set of 164 problems\.
def generate\_integers\(a, b\): """ Given two positive integers a and b, return the even digits between a and b, in ascending order\.
For example: generate\_integers\(2, 8\) =\> \[2, 4, 6, 8\] generate\_integers\(8, 2\) =\> \[2, 4, 6, 8\] generate\_integers\(10, 14\) =\> \[\] """
Given each prompt, we generate a single completion and extract the code for the target function\. The completion is considered correct if the generated function passes all provided unit tests \(i\.e\., functional correctness at pass@1\)\.
HumanEval is the domain where residual segmentation sensitivity is most behaviorally useful in our experiments\. Unlike MMLU, the output space is enormous, and unlike GSM8K, there are many structurally distinct correct solutions\. This makes it plausible for semantically equivalent prompt segmentations to steer the model toward different algorithmic families while still preserving task identity\.
## Appendix BBenchmark Aggregation and Error Bars
The unbiased estimator for pass\-at\-k, which we refer to herepass\(k\)\\mathrm\{pass\}\(k\), assumes we uniformly sample from an existing set ofnncompletions,ccof which are correct\. Then
pass\(k\)=1−C\(n−c,k\)C\(n,k\),\\displaystyle\{\\rm pass\}\(k\)=1\-\\frac\{C\(n\-c,k\)\}\{C\(n,k\)\},\(7\)whereC\(n,k\)=\(nk\)C\(n,k\)=\\binom\{n\}\{k\}is the binomial coefficient\. For the plotted pass curves, this estimator is applied at the level of individual benchmark tasks rather than after pooling all completions across the benchmark\. Concretely, for taskiiwe collectnin\_\{i\}completions and letcic\_\{i\}denote the number of correct ones\. The task\-level empirical pass rate at budgetkkis
passi\(k\)=1−C\(ni−ci,k\)C\(ni,k\)\.\\displaystyle\{\\rm pass\}\_\{i\}\(k\)=1\-\\frac\{C\(n\_\{i\}\-c\_\{i\},k\)\}\{C\(n\_\{i\},k\)\}\.\(8\)The reported benchmark value is then the mean of these task\-level quantities over theNNtasks in the benchmark:
pass𝒟\(k\)=1N∑i=1Npassi\(k\)\.\\displaystyle\{\\rm pass\}\_\{\\mathcal\{D\}\}\(k\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathrm\{pass\}\_\{i\}\(k\)\.\(9\)Error bars are standard errors over tasks\. If
sk2=1N−1∑i=1N\(passi\(k\)−pass𝒟\(k\)\)2\\displaystyle s\_\{k\}^\{2\}=\\frac\{1\}\{N\-1\}\\sum\_\{i=1\}^\{N\}\\left\(\\mathrm\{pass\}\_\{i\}\(k\)\-\\mathrm\{pass\}\_\{\\mathcal\{D\}\}\(k\)\\right\)^\{2\}\(10\)is the sample variance of the task\-level pass estimates at a fixedkk, then the plotted standard error is
SEk=skN\.\\displaystyle\\mathrm\{SE\}\_\{k\}=\\frac\{s\_\{k\}\}\{\\sqrt\{N\}\}\.\(11\)Thus the uncertainty shown in the figures reflects variation across benchmark problems, not variation across repeated generations for a single problem\.
The same aggregation logic is used forpass@retok\(k\)\\mathrm\{pass@retok\}\(k\)and typo\-based curves: for each task we first compute the success probability implied by that task’s set of variants, and we then average those per\-task values across the benchmark\. For MMLU temperature curves, where the relevant baseline is the canonical next\-token answer probability rather than sampled completions, the task\-level quantity is computed from that answer probability and then averaged across tasks in the same way\.
## Appendix CEstimating Compute Cost inRetokenization Sampling
The usualpass@k\\mathrm\{pass@k\}estimator treats each draw as equally expensive in prompt tokens, which is reasonable as sampling occurs only from the output distribution\. ForRetokenization Sampling, however, the input itself changes length because non\-canonical segmentations often use more tokens than the canonical prompt\. A compute\-aware comparison between sampling methods therefore needs both an accuracy estimator and a prompt\-budget model\.
To estimate the prompt token cost for taskiiin a benchmark, letLc,iL\_\{c,i\}be the average length of prompts that generate correct completions andLw,iL\_\{w,i\}the average length of prompts that generate incorrect completions\. For a fixed sample sizekk, the expected total length is
Li\(k\)=∑q=1kpqi\(qLc,i\+\(k−q\)Lw,i\)\\displaystyle L\_\{i\}\(k\)=\\sum\_\{q=1\}^\{k\}p\_\{q\}^\{i\}\\left\(qL\_\{c,i\}\+\(k\-q\)L\_\{w,i\}\\right\)\(12\)wherepqip\_\{q\}^\{i\}is the probability of drawing exactlyqqcorrect completions out ofkkattempts, given by the hypergeometric distribution
pqi=\(ciq\)\(n−cik−q\)\(nk\),\\displaystyle p\_\{q\}^\{i\}=\\frac\{\\binom\{c\_\{i\}\}\{q\}\\binom\{n\-c\_\{i\}\}\{k\-q\}\}\{\\binom\{n\}\{k\}\},\(13\)wherecic^\{i\}is the number of correct completions amongnnsamples\. It is possible to express this in terms ofpass@k\\mathrm\{pass@k\}:
Li\(k\)=kLw,ipassi\(k\)\+\(Lc,i−Lw,i\)fik,\\displaystyle L\_\{i\}\(k\)=kL\_\{w,i\}\\,\\mathrm\{pass\}\_\{i\}\(k\)\+\\left\(L\_\{c,i\}\-L\_\{w,i\}\\right\)f\_\{i\}k,\(14\)where we used the definitionpassi\(k\)=∑q=1kpq\\mathrm\{pass\}\_\{i\}\(k\)=\\sum\_\{q=1\}^\{k\}p\_\{q\}, and have introduced the total fraction of passing solutionsfi=ci/nf\_\{i\}=c\_\{i\}/n\. The benchmark averaged expected length is
L¯\(k\)/k=𝔼𝒟\[Lw,ipassi\(k\)\]\+𝔼𝒟\[\(Lc,i−Lw,i\)fi\],\\displaystyle\\bar\{L\}\(k\)/k=\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\left\[L\_\{w,i\}\\mathrm\{pass\}\_\{i\}\(k\)\\right\]\+\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\left\[\\left\(L\_\{c,i\}\-L\_\{w,i\}\\right\)f\_\{i\}\\right\],\(15\)which assuming independence between the problem dependent variables, gives the approximation
L¯\(k\)/k≈L¯wpass@k\+\(L¯c−L¯w\)f¯\.\\displaystyle\\bar\{L\}\(k\)/k\\approx\\bar\{L\}\_\{w\}\\,\\mathrm\{pass@k\}\+\\left\(\\bar\{L\}\_\{c\}\-\\bar\{L\}\_\{w\}\\right\)\\bar\{f\}\.\(16\)In practice, we estimate prompt\-side cost directly from the recovered tokenized prompts for each variant family and then plot pass rate against expected prompt tokens, as in the HumanEval comparison in the main text \(Figure[6](https://arxiv.org/html/2606.15521#S3.F6)\)\. This does not capture every source of wall\-clock variation, but it is the relevant first\-order correction for retokenization because the method changes how much computation is spent on the input context before any completion tokens are generated\.
Figure 7:Prompt\-budget frontier on HumanEval\.The horizontal axis shows benchmark pass rate and the vertical axis shows average prompt\-token cost per sample,L¯\(k\)/k\\bar\{L\}\(k\)/k, for three sampling procedures: canonical temperature sampling \(pass@k\\mathrm\{pass@k\}, green\), retokenization sampling \(pass@retok\\mathrm\{pass@retok\}, red\), and typo sampling \(pass@typo\\mathrm\{pass@typo\}, blue\)\. Each point corresponds to a different sample budgetkk\. Retokenization traces the steepest frontier, indicating that it requires the most prompt tokens per unit gain in pass rate, while canonical temperature sampling remains the most prompt\-efficient and typo sampling lies in between\.
## Appendix DAdditional Results
### D\.1Retokenized context helps with next\-token prediction
We probed the emergence of segmentation symmetry in several ways\. In this section, we show that a trained model is able to recover useful information in retokenized text, and use it for prediction\. In particular, we study whether a model can use a non\-canonical segmentation of a passage to improve prediction on the canonical tokenization of the same passage\. Following the setup ofSivan and Tsodyks \[[2025](https://arxiv.org/html/2606.15521#bib.bib49)\]for measuring meaningful information rate in communication, we measure the cumulative information content, or surprisal, of a canonically tokenized natural language passage under three conditions:\(1\)no preceding context,\(2\)an identical retokenized version of the passage in context, and\(3\)an identical canonically tokenized version of the passage in context\. For conditions\(2\)and\(3\), we use the following prompt:
The following texts, separated by \-\-\-\-, are identical\.E′E^\{\\prime\}\-\-\-\-E0E^\{0\}
WhereE0E^\{0\}is the canonical tokenization of the passage, andE′E^\{\\prime\}is either a retokenization \(condition\(2\)\) or the canonical tokenization \(condition\(3\)\)\.
To define our measurement mathematically, letE0\(s\)=\(e1,…,eT\)E^\{0\}\(s\)=\(e\_\{1\},\.\.\.,e\_\{T\}\)be the canonical token sequence for passagess, andWWbe the prefix corresponding to one of the three conditions listed above\. The information content of each token in the canonical sequence is given by the logarithm of the probability of that token, conditioned on all past tokens
It\(W\)=−logP\(et∣e<t,W\)\.\\displaystyle I\_\{t\}\(W\)=\-\\log P\\left\(e\_\{t\}\\mid e\_\{<t\},W\\right\)\.\(17\)This is often also called thesurprisal, since a low value indicates a predictable token, while a high value indicates a low probability and hence “surprising" token\. We measure the cumulative information content, which is the cumulative sum of the information content or surprisal over the token sequence of interest, conditioned on the test conditionWW:
Ht\(W\)=∑s=1tIs\(W\)\.\\displaystyle H\_\{t\}\(W\)=\\sum\_\{s=1\}^\{t\}I\_\{s\}\(W\)\.\(18\)From this, we get theentropy rateh\(W\)=HT\(W\)/Th\(W\)=H\_\{T\}\(W\)/T\. The results for OLMo\-2\-7B are shown in Fig\.[8](https://arxiv.org/html/2606.15521#A4.F8)\. If the model understands that the retokenized prefix carries the same content as the canonical sequence, then the surprisal of the canonical continuation should fall sharply relative to the no\-context baseline\. This is exactly what we observe: after seeing the retokenized text, the information content of the canonical sequence drops substantially, reducing the entropy rate from roughly 2\.5 nats/token, typical of natural language, to about 0\.06 nats/token\. We also track this entropy\-rate gap across OLMo\-2 pretraining checkpoints\[OLMo et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib32)\], which lets us follow the emergence of this compositional understanding over training\.
One possibility is that after seeing a retokenized passage, the only entropy that remains is the entropy due to segmentation\. If every segmentation was uniformly weighted, the entropy rate would be upper bounded by approximatelyln2\\ln 2\(≈0\.69\\approx 0\.69\) nats/token, which is well above the observed residual entropy of 0\.06 nats/token\. However, it is more likely that the distribution over segmentations is nonuniform, and that the distribution is strongly biased toward the canonical segmentation\.
Figure 8:Retokenized context sharply reduces surprisal, and this ability emerges during training\.Left:Cumulative information contentHtH\_\{t\}[18](https://arxiv.org/html/2606.15521#A4.E18)of a canonically tokenized passage under three conditions: no context \(blue\), identical retokenized context \(red\), and identical canonically tokenized context \(green\), for Olmo\-2\-7B Instruct\. A retokenized prefix greatly lowers the surprisal of the canonical continuation, approaching the canonical\-copy baseline\.Middle:Across OLMo\-2\-7B pretraining checkpoints, the mean entropy rate of canonical text remains high irrespective of the context\. However, while the no context condition plateaus, the retokenized context continues to decrease\.Right:The corresponding fractional entropy reduction increases over training, showing the emergence of segmentation\-level compositional understanding\. Data is computed on 50 passages from the raw variant of Wikitext\-103Merity et al\. \[[2017](https://arxiv.org/html/2606.15521#bib.bib30)\], using 20 retokenizations per passage, andpretok=1\.0p\_\{retok\}=1\.0\.
### D\.2PfailP\_\{fail\}distribution across datasets
Figure[2](https://arxiv.org/html/2606.15521#S3.F2)uses aggregate pass curves andΔP\\Delta Phistograms to show that retokenization affects some problems much more than others\. Figure[9](https://arxiv.org/html/2606.15521#A4.F9)makes that heterogeneity explicit by plotting the full distribution of empirical per\-problem failure probabilities \(PfailP\_\{fail\}\) under canonical output sampling and retokenization sampling\. Across datasets, the shape of this distribution reveals whether a benchmark is dominated by easy problems, hard problems, or a broad middle regime\. HumanEval and GSM8K Python show the clearest right\-shift under retokenization, indicating that many tasks become less reliable when the same prompt is presented through non\-canonical segmentations\. GSM8K exhibits a broader spread, while MMLU remains strongly bimodal because many questions are either almost always solved or almost always failed\. These distributions therefore complement the pass curves by showing not only whether retokenization lowers average performance, but how that loss is distributed across benchmark instances\.
Figure 9:Per\-problem failure\-probability distributions under canonical and retokenized sampling\.For OLMo\-2\-1124\-7B\-Instruct, each panel shows the distribution over benchmark problems ofP\(failure\)P\(\\mathrm\{failure\}\)estimated from canonical output sampling \(pass@k\\mathrm\{pass@k\}, blue\) and retokenization sampling \(pass@retok\\mathrm\{pass@retok\}, red\)\. HumanEval and GSM8K Python show the clearest right\-shift under retokenization, indicating that many problems become harder when the same prompt is presented through non\-canonical segmentations\. GSM8K exhibits a broader spread under retokenization than under canonical sampling, while MMLU remains strongly bimodal because many questions are either almost always solved or almost always failed\.
### D\.3Retokenization, Typos, and Temperature Sampling
Figure[10](https://arxiv.org/html/2606.15521#A4.F10)overlays the pass curves for canonical output sampling, retokenization sampling, and typo sampling across four benchmarks \- HumanEval, GSM8K, GSM8K Python and MMLU\. This makes the broad pattern easy to see: on the open\-ended tasks, conventional temperature sampling is strongest, typo perturbations usually provide a stronger input\-side baseline than retokenization, and retokenization remains the weakest but most controlled search axis\. MMLU again stands apart because repeated sampling behaves differently in its small multiple\-choice answer space\.
Figure 10:All pass curves overlaid across datasets and perturbation families\.Colors denote benchmarks: HumanEval \(blue\), GSM8K \(orange\), MMLU \(green\), and GSM8K Python \(red\)\. Line styles denote sampling method: dashed forpass@k\\mathrm\{pass@k\}, solid forpass@retok\\mathrm\{pass@retok\}, and dotted forpass@typo\\mathrm\{pass@typo\}\. Across HumanEval, GSM8K, and GSM8K Python, conventional output\-side sampling is strongest, typo sampling is usually intermediate, and retokenization is weaker but remains competitive\. On the right, we plot fraction of hard problems𝔼\[q\>0\.9\]q∼Pfail\\mathbb\{E\}\\left\[q\>0\.9\\right\]\_\{q\\sim P\_\{fail\}\}against fraction of easy problems𝔼\[q<0\.1\]q∼Pfail\\mathbb\{E\}\\left\[q<0\.1\\right\]\_\{q\\sim P\_\{fail\}\}\.Figure[10](https://arxiv.org/html/2606.15521#A4.F10)provides a compact cross\-dataset summary of the ordering between the three sampling procedures\. On HumanEval, GSM8K and GSM8K Python, the main pattern is consistent: temperature sampling yields the highest pass curves, typo perturbations provide a stronger input\-side baseline than retokenization, and retokenization remains the cleanest but weakest search axis\. For MMLU, the limited four\-choice answer space makes repeated sampling behavior qualitatively different\. This overlay therefore reinforces the main text claim that retokenization is best understood as a controlled and complementary source of diversity rather than a universal replacement for conventional sampling\.
### D\.4OLMo\-2 Pass Rates on HumanEval at Different Checkpoints and Model Sizes
Figure 11:Retokenization robustness across OLMo\-2 checkpoints and model sizes on HumanEval\.Left: within the OLMo\-2\-1124\-7B pretraining run, later checkpoints improve both canonicalpass@k\\mathrm\{pass@k\}\(solid\) and retokenizedpass@retok\\mathrm\{pass@retok\}\(dashed\), indicating that robustness to equivalent segmentations increases with training progress\.Right: across OLMo\-2 Instruct model sizes, larger models achieve higher pass curves under both sampling procedures, while the gap between canonical and retokenized performance remains model dependent\.
### D\.5OLMo\-2 Pass Rates Across Retokenization Probabilitiespretokp\_\{\\mathrm\{retok\}\}
Figure[12](https://arxiv.org/html/2606.15521#A4.F12)separates the pooledpass@retok\\mathrm\{pass@retok\}results by retokenization ratepretokp\_\{\\mathrm\{retok\}\}on HumanEval\. This makes the symmetry–sensitivity tradeoff more explicit\. Smaller perturbation rates preserve more of the canonical prompt structure and therefore retain higher task performance, while larger rates introduce more aggressive segmentation changes and push more problems toward failure\. The left panel shows this as a family of pass curves, where moderate values ofpretokp\_\{\\mathrm\{retok\}\}remain competitive over a broad range ofkkand extreme retokenization is consistently weaker\. The right panel shows the same pattern in distributional form: aspretokp\_\{\\mathrm\{retok\}\}increases, the per\-task failure\-probability histogram shifts toward larger values and the mean failure probability rises\.
Figure 12:HumanEval retokenization curves across perturbation rates\.Left:pass@retok\\mathrm\{pass@retok\}curves for HumanEval at different retokenization ratespretokp\_\{\\mathrm\{retok\}\}, compared against canonicalpass@k\\mathrm\{pass@k\}\. Moderate retokenization rates preserve more task performance than aggressive retokenization, while still producing diversity\.Right: histograms of per\-task failure probabilities for the same retokenization rates, showing that increasingpretokp\_\{\\mathrm\{retok\}\}shifts more tasks toward high failure probability\. Dashed lines show meanPfailP\_\{fail\}\.
### D\.6Segmentation robustness across different model families
Across model families on HumanEval, stronger models also tend to remain stronger under equivalent segmentations \(Figure[13](https://arxiv.org/html/2606.15521#A4.F13)\)\. Qwen3\-8B\[Yang et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib55)\]and Llama\-3\.1\-8B\-Instruct\[Grattafiori et al\.,[2024](https://arxiv.org/html/2606.15521#bib.bib18)\]sustain the highestpass@retok\(k\)\\mathrm\{pass@retok\}\(k\)curves, while weaker models remain low under bothpass@k\\mathrm\{pass@k\}andpass@retok\(k\)\\mathrm\{pass@retok\}\(k\)\. At the same time, the gap between canonical sampling and retokenization is clearly model dependent\. Qwen3\-8B shows the smallest gap, withpass@retok\(k\)\\mathrm\{pass@retok\}\(k\)remaining very close to canonicalpass@k\\mathrm\{pass@k\}across the full range ofkk, whereas Gemma\[Team et al\.,[2025](https://arxiv.org/html/2606.15521#bib.bib51)\]and Pythia\[Biderman et al\.,[2023](https://arxiv.org/html/2606.15521#bib.bib3)\]models exhibit substantially larger separations\. This suggests that segmentation robustness broadly tracks overall capability, but the extent to which equivalent segmentations preserve useful search behavior is also shaped by model\-family\-specific inductive biases\.
Figure 13:Retokenization robustness across model families on HumanEval\.Solid curves denotepass@k\\mathrm\{pass@k\}and dashed curves denotepass@retok\\mathrm\{pass@retok\}\. Stronger models tend to remain stronger under equivalent segmentations, with Qwen3\-8B and Llama\-3\.1\-8B\-Instruct sustaining the highest retokenized pass curves\.
### D\.7Behavioral Plots
The main text focuses on the aggregate shape of the pass curves\. The two figures below isolate two points that are easy to miss in the main presentation: first, how the gap betweenpass@k\\mathrm\{pass@k\}andpass@retok\\mathrm\{pass@retok\}differs by dataset, and second, why MMLU requires special care when interpreted through a pass@k\-style lens\.
Figure 14:Dataset\-specific gap between temperature and retokenization sampling\.The plotted quantity ispass@k−pass@retok\\mathrm\{pass@k\}\-\\mathrm\{pass@retok\}as a function ofkk\. HumanEval shows the largest persistent advantage for temperature sampling, GSM8K a smaller advantage that gradually shrinks, and MMLU a reversal driven by the peculiar saturation behavior of repeated sampling in a four\-choice answer space\.Figure[14](https://arxiv.org/html/2606.15521#A4.F14)plotspass@k−pass@retok\\mathrm\{pass@k\}\-\\mathrm\{pass@retok\}as a function ofkk\. HumanEval exhibits the largest persistent positive gap, GSM8K a smaller one that steadily shrinks, and MMLU a negative gap\. The sign flip on MMLU is not evidence that retokenization is universally stronger there\. Instead it reflects the unusual geometry of repeated sampling in a four\-choice answer space, where output\-side resampling can saturate very quickly\.
Figure 15:Why MMLU needs a special temperature baseline\.The gray dashed curve shows the pass curve obtained by repeatedly sampling answer letters directly from the canonical next\-token distribution, while green compares standardpass@k\\mathrm\{pass@k\}andpass@retok\\mathrm\{pass@retok\}on the same benchmark\. Because the answer space is so small, repeated next\-token sampling drives pass rate toward 1 very quickly; this is why MMLU temperature comparisons should be based on the canonical answer probabilities rather than unconstrained multi\-token generations\.Figure[15](https://arxiv.org/html/2606.15521#A4.F15)makes this saturation issue explicit\. If we repeatedly sample answer letters from the model’s canonical next\-token distribution on MMLU, the induced pass curve approaches 1 extremely quickly\. That curve is much stronger than either free\-form sampling or retokenization, but it is also much less informative because it mainly reflects repeated draws from a tiny answer set rather than new reasoning trajectories\. This is why our MMLU temperature baseline is defined from the canonical answer probabilities of the greedyp=0p=0run rather than from unconstrained multi\-token generations\.
### D\.8Generating Passing Retokenizations via Iterative BPE Merging
The subset of initial passing retokenizations on the HumanEval benchmark is a fraction of the total number of initial retokenizations: a total 20 across all five problems\. To increase this subset, we attempt to use these passing retokenizations to generate more\. To accomplish this, we begin with the initial retokenization for a given problem and iteratively apply one BPE merge of tokens from the tokenizer’s merge list \(thus ensuring the generated tokens are present in the vocabulary\)\. We then prompt the model with the new retokenization on the same problem and see if the retokenization generates a passing solution using the HumanEval unit test suitesChen et al\. \[[2021](https://arxiv.org/html/2606.15521#bib.bib7)\]\. If the solution passes the test, we accept the merge, generating a passing retokenization, and apply the next merge in the merge list\. If the solution fails, we reject the merge, returning to the retokenization before the merge, and attempt to apply the next merge in the merge list\. The process stops once we reach the canonical tokenization or we run out of valid merges for the retokenization\. No problems in our dataset ever reached the canonical retokenization\. For problems with many initial retokenizations that generated passing solutions \(HumanEval problems 26 and 159\), we perform this process for each initial retokenization individually\. Interestingly, this process worked very well for HumanEval problems 26, 84, and 115, generating, collectively, 459 new passing retokenizations, but was not able to generate new passing retokenizations for the other problems\. The results of this process are shown in Table[1](https://arxiv.org/html/2606.15521#A4.T1)\.
Table 1:Number of passing retokenizations before and after iterative BPE merge generation, out of 256 initial retokenizations per problem\. “Generated passing” counts additional retokenizations produced by the iterative merge process starting from each initial passing retokenization\.
## Appendix EExperimental Compute Resources
Most experiments were run on single H100 and A100 GPUs\. The bulk of compute resources were expended in retokenization, temperature and typo sampling procedures\. Across all models used in this study, the wallclock runtime for each sampling process was approximately 6 hours for each benchmark dataset \(with parameter defined in Appendix[A](https://arxiv.org/html/2606.15521#A1)\)\. The syntactic diversity calculations were run using 32 Intel Xeon Platinum 8559C cpus for a total wall clock time of 2 hours\. Compute cost calculation were run on CPU as well for less than a minute\.
## Appendix FLicenses for existing assets
We cite the original creators and sources for all models and datasets used in our experiments\. The licenses for these artifacts are listed below\.
Table 2:Licenses for models and datasets used in our experiments\.Similar Articles
Stochasticity in Tokenization Improves Robustness
This paper demonstrates that training large language models with stochastic tokenization instead of deterministic canonical tokenization significantly improves robustness to adversarial attacks and random perturbations, with improvements shown across pre-training, fine-tuning, and in-context learning without increasing inference costs.
How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures
This paper characterizes two distinct processes by which language models fail in reasoning—committed failure and persistent uncertainty—using token-level uncertainty signals, and demonstrates implications for self-consistency and failure detection strategies.
Sequential statistical inference for Large Language Models: Representation, validity, and monitoring
This paper argues for a sequential inference framework to enhance LLM trustworthiness by modeling interactions as dependent stochastic processes, ensuring validity under repeated use, and enabling online monitoring for behavioral shifts.
The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment
Proposes the Piggyback Hypothesis that chat-template tokens can cause emergent misalignment in LLMs, and introduces Token-Regularized Finetuning (TReFT) to mitigate it while preserving in-domain learning.
Representation Collapse in Sequential Post-Training of Large Language Models
This paper studies representation collapse in sequential post-training of large language models, showing that repeated adaptation stages compress internal representations, reducing plasticity and out-of-domain generalization. The authors propose lightweight interventions to preserve future learnability without sacrificing behavioral gains.