Incremental BPE Tokenization

arXiv cs.CL Papers

Summary

This paper introduces an incremental algorithm for Byte Pair Encoding (BPE) tokenization that processes each byte in O(log^2 t) time, enabling efficient partial tokenization in streaming settings and achieving speedups over existing implementations.

arXiv:2605.30813v1 Announce Type: new Abstract: We propose a novel algorithm for incremental Byte Pair Encoding (BPE) tokenization. The algorithm processes each input byte in worst-case $\mathcal{O}(\log^2 t)$ time, leading to an overall complexity of $\mathcal{O}(n \log^2 t)$, where $n$ is the input length and $t$ is the maximum token length. The algorithm incrementally maintains BPE tokenization results for every prefix of the input text, implementing the standard BPE merge procedure defined by a fixed set of merge rules. This enables efficient partial tokenization in streaming settings. Functioning as a drop-in replacement for standard BPE, our approach achieves a speedup of up to ${\sim}3\times$ over Hugging Face's tokenizers, and demonstrates significant latency reductions over OpenAI's tiktoken on pathological inputs. We further introduce an eager output algorithm that enables streaming output, emitting tokens as soon as token boundaries are determined during incremental tokenization. Overall, our results demonstrate that BPE tokenization can be performed incrementally with strong worst-case guarantees, while providing practical latency benefits in modern large language model pipelines. Code: https://github.com/ModelTC/mtc-inc-bpe
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:29 AM

# Incremental BPE Tokenization
Source: [https://arxiv.org/html/2605.30813](https://arxiv.org/html/2605.30813)
###### Abstract

We propose a novel algorithm for incremental Byte Pair Encoding \(BPE\) tokenization\. The algorithm processes each input byte inworst\-case𝒪​\(log2⁡t\)\\mathcal\{O\}\(\\log^\{2\}t\)time, leading to an overall complexity of𝒪​\(n​log2⁡t\)\\mathcal\{O\}\(n\\log^\{2\}t\), wherennis the input length andttis the maximum token length\. The algorithm incrementally maintains BPE tokenization results for every prefix of the input text, implementing the standard BPE merge procedure defined by a fixed set of merge rules\. This enables efficient partial tokenization in streaming settings\. Functioning as a drop\-in replacement for standard BPE, our approach achieves a speedup of up to∼3×\{\\sim\}3\\timesover Hugging Face’stokenizers, and demonstrates significant latency reductions over OpenAI’stiktokenon pathological inputs\. We further introduce an eager output algorithm that enables streaming output, emitting tokens as soon as token boundaries are determined during incremental tokenization\. Overall, our results demonstrate that BPE tokenization can be performed incrementally with strong worst\-case guarantees, while providing practical latency benefits in modern large language model pipelines\. The source code is available at[https://github\.com/ModelTC/mtc\-inc\-bpe](https://github.com/ModelTC/mtc-inc-bpe)\.

Byte Pair Encoding, Tokenization, Incremental Algorithm, LLM Efficiency, Streaming Inference, Worst\-Case Complexity

## 1Introduction

Byte Pair Encoding \(BPE\), originally proposed as a data compression technique byGage \([1994](https://arxiv.org/html/2605.30813#bib.bib2)\), has become the de facto tokenization method for modern large language models \(LLMs\) following its adaptation bySennrichet al\.\([2016](https://arxiv.org/html/2605.30813#bib.bib1)\)\. By iteratively merging frequent adjacent symbol pairs, BPE constructs a compact vocabulary that balances expressiveness with computational efficiency\. Consequently, highly optimized implementations such as Hugging Face’stokenizers\(Hugging Face,[2026](https://arxiv.org/html/2605.30813#bib.bib5)\)and OpenAI’stiktoken\(OpenAI,[2026b](https://arxiv.org/html/2605.30813#bib.bib6)\)have become standard components in modern LLM pipelines\.

[Figure1](https://arxiv.org/html/2605.30813#S1.F1)illustrates a representative pipeline used in modern LLM tokenization, where BPE is applied as a core stage within a broader sequence of preprocessing steps\.

![Refer to caption](https://arxiv.org/html/2605.30813v1/x1.png)

Figure 1:A representative pipeline in modern LLM tokenization\.Raw text is processed by a sequence of model\-specific preprocessing stages before applying BPE tokenization to each resulting chunk\.Our work focuses on making the BPE stage incremental without altering the correctness and the surrounding stages\.However, these tools are predominantly designed as offline processes: they require the entire input sequence \(or a pre\-segmented chunk\) to be fully observed before producing a canonical tokenization\. As a result, tokenization and prefilling are executed sequentially rather than concurrently, which introduces additional latency in streaming and long\-context inference\.

At an algorithmic level, existing implementations typically rely on heap\-based priority queues to select merges over the full input, leading to a formulation that is inherently global and difficult to adapt to byte\-by\-byte updates\. The formalization of BPE byBerglund and van der Merwe \([2023](https://arxiv.org/html/2605.30813#bib.bib3)\)implies a structural property where the tokenizations for every prefix of a text form a prefix tree of tokens\. This property suggests the possibility of immediate feedback and partial processing; however, exploiting it efficiently requires algorithmic support beyond standard implementations\.

In this paper, we present a novel algorithm forincremental BPE tokenizationthat efficiently maintains BPE tokenizations for every prefix in real\-time\. Our algorithm processes each input byte in a worst\-case time complexity of𝒪​\(log2⁡t\)\\mathcal\{O\}\(\\log^\{2\}t\), resulting in an overall time complexity of𝒪​\(n​log2⁡t\)\\mathcal\{O\}\(n\\log^\{2\}t\), wherennis the input length andttis the maximum token length\.

We implement the algorithm in Rust as a drop\-in replacement for core BPE stages\. Benchmark results \([Table1](https://arxiv.org/html/2605.30813#S7.T1)\) show an end\-to\-end speedup of up to∼3×\{\\sim\}3\\timesover Hugging Face’stokenizers\. Furthermore, while existing tools like OpenAI’stiktokenexhibit𝒪​\(n2\)\\mathcal\{O\}\(n^\{2\}\)behavior on certain inputs, our approach maintains a stable𝒪​\(n​log2⁡t\)\\mathcal\{O\}\(n\\log^\{2\}t\)cost, leading to significant latency reductions on pathological cases \([Figure3](https://arxiv.org/html/2605.30813#S6.F3)\)\.

Beyond raw speed and streaming input, we introduce aneager outputmechanism for streaming output, which emits tokens as soon as stable token boundaries are resolved\. This enables tokenization to be fully pipelined with model inference\.

Profiling of current systems \(Appendix[I](https://arxiv.org/html/2605.30813#A9)\) shows that stages other than BPE can become bottlenecks in tokenization pipelines, burdening both streaming input and output\. With strict worst\-case complexity guarantees, such pre\-segmentation can be algorithmically redundant\.

Furthermore, along with incremental updates, the algorithm naturally supports scenarios requiring full\-prefix access, i\.e\., scenarios in which tokenizations of all prefixes are required, such as deriving attention masks from the prefix tree of tokens for “fill\-in\-the\-middle” \(FIM\) tasks without random or heuristic truncation and re\-computation\.

Our results demonstrate that incremental BPE tokenization is both algorithmically feasible and practically beneficial, withstrict worst\-case guaranteesandexact semantic compatibility\.

#### Contributions\.

We make the following contributions:

- •An incremental BPE algorithm is proposed with a strict worst\-case𝒪​\(log2⁡t\)\\mathcal\{O\}\(\\log^\{2\}t\)per\-byte complexity, ensuring exact equivalence with standard BPE alongside rigorous performance guarantees\.
- •The proposed eager output mechanism enables efficient, real\-time streaming and pipelining with model inference\.
- •Our efficient implementation in Rust serves as a drop\-in replacement that delivers a speedup of up to∼3×\{\\sim\}3\\timesover current state\-of\-the\-art tools\.
- •With our method, stages of traditional pipelines that serve solely to circumvent the limits of global BPE can be safely removed without performance concerns\.
- •Our method provides native support for full\-prefix tasks \(e\.g\., FIM, token healing\) by efficiently providing tokenizations of all prefixes, eliminating the need for heuristic truncation or re\-computation\.

## 2Related Work

Byte Pair Encoding\(Sennrichet al\.,[2016](https://arxiv.org/html/2605.30813#bib.bib1)\)has established itself as the de facto standard for modern LLMs, employed by state\-of\-the\-art architectures such as GPT\-5\(OpenAI,[2026a](https://arxiv.org/html/2605.30813#bib.bib10)\)and Qwen\-3\(Qwen Team,[2025a](https://arxiv.org/html/2605.30813#bib.bib18)\)\. Despite alternatives like WordPiece\(Schuster and Nakajima,[2012](https://arxiv.org/html/2605.30813#bib.bib11)\)used in BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2605.30813#bib.bib13)\)and Unigram\(Kudo,[2018](https://arxiv.org/html/2605.30813#bib.bib12)\)adopted by T5\(Raffelet al\.,[2020](https://arxiv.org/html/2605.30813#bib.bib14)\), BPE remains the predominant choice for current models\.

In terms of model architecture, usage patterns have evolved: while earlier models typically applied BPE to the unsegmented input text, modern LLMs increasingly adopt regex\-based pre\-segmentation and other normalization to enforce linguistic boundaries and partition the input\.

Leading libraries implement these strategies differently: Hugging Face’stokenizers\(Hugging Face,[2026](https://arxiv.org/html/2605.30813#bib.bib5)\)provides a general\-purpose implementation using priority queues, whereas OpenAI’stiktoken\(OpenAI,[2026b](https://arxiv.org/html/2605.30813#bib.bib6)\)specializes in the regex\-based approach, relying on segmentation to bound the cost of BPE merges\.

Despite these optimizations, production\-grade tokenization remains predominantlyoffline\. Whether relying on global queues or pre\-segmentation, canonical BPE tokenization is typically performed after a complete segment is available\.

This architecture restricts the ability to hide latency by pipelining tokenization with model inference at a fine\-grained level\. Furthermore, the reliance on regex engines for complexity control can be brittle, potentially introducing performance degradation or stack exhaustion on pathological inputs such as extensive repetitions\.

On the theoretical side,Berglund and van der Merwe \([2023](https://arxiv.org/html/2605.30813#bib.bib3)\)analyze the formal semantics of different BPE implementations and note its consistency under token\-wise truncation\. This observation implies that the tokenization of a string prefix remains stable within the full tokenization, hinting at the possibility of incremental processing\. However, their work focuses on algebraic properties rather than algorithmic construction, and does not address the challenge of bounding the lookahead required for online updates\.

In the engineering domain, practical implementations such as the cratebpeinrust\-gems\(GitHub,[2026](https://arxiv.org/html/2605.30813#bib.bib7)\), whose techniques are discussed byvan Antwerpen and Neubeck \([2024](https://arxiv.org/html/2605.30813#bib.bib8)\), have adopted the Aho–Corasick automaton\(Aho and Corasick,[1975](https://arxiv.org/html/2605.30813#bib.bib4)\)for incremental tokenization\. The detailed comparison is discussed in Appendix[J](https://arxiv.org/html/2605.30813#A10)\. While demonstrating practical utility, these approaches generally lack formal worst\-case complexity guarantees\.

Our work bridges this gap by providing an incremental algorithm with𝒪​\(log2⁡t\)\\mathcal\{O\}\(\\log^\{2\}t\)worst\-case update time while maintaining strict compatibility with standard BPE\.

## 3Structural Foundations of Incremental BPE

In this section, we establish the structural foundations of our algorithm\. We first formalize the incremental tokenization problem by explicitly defining the concept of theLast Tokenand deriving the recursive properties of BPE\. We then introduce a normalized representation of the BPE vocabulary to eliminate redundancy\. Finally, we construct theSuccessor Forestand theSuffix\-Successor Tree, which organize the search space for incremental tokenization\. These structures provide the topological basis for the theoretical properties and search algorithms presented in subsequent sections\.

### 3\.1Preliminaries and Definitions

LetΣ\\Sigmabe a finite alphabet andV⊂Σ\+V\\subset\\Sigma^\{\+\}be a finite vocabulary\. The vocabularyVVconsists ofatomic tokens\(where\|t\|=1\|t\|=1\) andnon\-atomic tokens\(where\|t\|\>1\|t\|\>1\)\.

Asuffix tokenof a stringssis defined as any tokent∈Vt\\in Vthat is a suffix ofss\.

Atoken sequenceφ=\[t1,…,tn\]\\varphi=\[t\_\{1\},\\dots,t\_\{n\}\]maps back to the string viadetokenizationπ​\(φ\)=t1​…​tn\\pi\(\\varphi\)=t\_\{1\}\\dots t\_\{n\}\.

Adictionaryis defined by an ordered list of merge rulesD=\[r1,r2,…,rm\]D=\[r\_\{1\},r\_\{2\},\\dots,r\_\{m\}\]\. Each rulerir\_\{i\}specifies a pair of adjacent tokens\(x,y\)\(x,y\)to be merged into a new tokenz=x​yz=xy\.

Applying a rulerrto a token sequenceφ\\varphi, denoted as𝕋r​\(φ\)\\mathbb\{T\}\_\{r\}\(\\varphi\), replaces occurrences of adjacent tokensxxandyywithzzsequentially from left to right\.

The full BPE tokenization for a stringsswith the dictionaryDD, denoted as𝕋D​\(s\)\\mathbb\{T\}\_\{D\}\(s\), is obtained by initially mappingssto a sequence of charactersφ0\\varphi\_\{0\}, and then applying the rules inDDaccording to their priority:

𝕋D​\(s\)=\(𝕋rm∘⋯∘𝕋r2∘𝕋r1\)​\(φ0\)\\mathbb\{T\}\_\{D\}\(s\)=\(\\mathbb\{T\}\_\{r\_\{m\}\}\\circ\\dots\\circ\\mathbb\{T\}\_\{r\_\{2\}\}\\circ\\mathbb\{T\}\_\{r\_\{1\}\}\)\(\\varphi\_\{0\}\)\(1\)wherer1r\_\{1\}represents the highest priority rule\.

Note that under this definition, the standard BPE differs from the SentencePiece\(Kudo and Richardson,[2018](https://arxiv.org/html/2605.30813#bib.bib9)\)semantics\. Our analysis is formulated under the standard BPE merge semantics\. See Appendix[A](https://arxiv.org/html/2605.30813#A1)for a detailed discussion on the difference between the two tokenization semantics and the method to “properize” merge rules for the standard BPE\.

In the remainder of this paper, we omit the subscriptDDand write𝕋​\(s\)\\mathbb\{T\}\(s\)when the dictionary is clear from context\.

### 3\.2Problem Formulation

#### Prefix Consistency\.

Our incremental approach relies on the consistency of BPE tokenization under token\-wise truncation\. As noted byBerglund and van der Merwe \([2023](https://arxiv.org/html/2605.30813#bib.bib3)\)\(Remark 3\), a valid BPE tokenization𝕋​\(⋅\)\\mathbb\{T\}\(\\cdot\)can be freely truncated: the remaining sequence remains the valid tokenization of its corresponding substring\.

###### Lemma 3\.1\(Prefix Consistency\)\.

Letssbe a string andφ=𝕋​\(s\)\\varphi=\\mathbb\{T\}\(s\)be its token sequence after tokenization\. For any proper prefixμ⊂φ\\mu\\subset\\varphiof the token sequence, the following identity holds:𝕋​\(π​\(μ\)\)=μ\\mathbb\{T\}\\left\(\\pi\\left\(\\mu\\right\)\\right\)=\\mu\.

Intuitively, this lemma holds because each BPE merge step is aboundary eliminationprocess visualized in[Figure4](https://arxiv.org/html/2605.30813#A2.F4)\. We provide an independent formal proof in Appendix[B](https://arxiv.org/html/2605.30813#A2)\.

This lemma ensures that the tokenization of a growing string can be modeled as extending the previous valid tokenization of some shorter prefix rather than re\-computing from scratch\.

#### The Last Token\.

Based on this consistency, we define theLast Token, denoted asθ​\(s\)\\theta\(s\)\. For any non\-empty stringss,θ​\(s\)\\theta\(s\)is the last token in its BPE tokenization sequence:θ​\(s\)=last⁡\(𝕋​\(s\)\)\\theta\(s\)=\\operatorname\{last\}\\left\(\\mathbb\{T\}\\left\(s\\right\)\\right\)\.

According to the Prefix Consistency lemma \(Lemma[3\.1](https://arxiv.org/html/2605.30813#S3.Thmtheorem1)\), since removing the last token leaves a valid tokenization sequence for the remaining string prefix, we can express the tokenization recursively:

𝕋​\(s\)=𝕋​\(spre\)⊕\[θ​\(s\)\]\\mathbb\{T\}\(s\)=\\mathbb\{T\}\\left\(s\_\{\\mathrm\{pre\}\}\\right\)\\oplus\\left\[\\theta\\left\(s\\right\)\\right\]\(2\)wherespres\_\{\\mathrm\{pre\}\}is the stringsswith the suffix string ofθ​\(s\)\\theta\(s\)removed, and⊕\\oplusdenotes sequence concatenation\.

In terms of complexity, the full\-prefix tokenization requires maintainingΘ​\(\|s\|\)\\Theta\(\|s\|\)states\.

#### Prefix Tree of Tokens\.

This recursive relationship implies that the tokenizations of all prefixes naturally form a hierarchical structure\. Strictly speaking, this structure constitutes a forest, as a tokenization sequence may begin with any valid token\. To unify this into a singlePrefix Tree of Tokens, we introduce avirtual rootrepresenting the tokenization of the empty stringε\\varepsilon, which serves as the shared root for all initial tokens\.

While we visualize the space of all prefix tokenizations as this tree, strictly maintaining the explicit structure is unnecessary\. Instead, the tokenization for any prefix can be retrieved solely bybacktrackingusingθ​\(⋅\)\\theta\(\\cdot\)until the beginning of the string \(the virtual root\) is reached\.

#### Incremental Objective\.

Consider an incremental scenario where a characterccis appended to the current stringss, resulting ins′=s​cs^\{\\prime\}=sc\. To update the tokenization, we only need to find the value ofθ​\(s′\)\\theta\(s^\{\\prime\}\)\. The objective is to identifyθ​\(s′\)\\theta\(s^\{\\prime\}\)within theintersectionof the set of all suffixes ofs′s^\{\\prime\}and the vocabularyVV\. In the following sections, we introduce theSuccessor Forestand other structures to efficiently solve this search problem\.

### 3\.3Normalization and Token Hierarchy

In practice, BPE vocabularies often contain unreachable entries\. To facilitate efficient updates, we define anormalized representation\. We say a tokent∈Vt\\in Viscanonicalif𝕋D​\(t\)=\[t\]\\mathbb\{T\}\_\{D\}\(t\)=\[t\]\. Thenormalized vocabulary𝒱\\mathcal\{V\}consists solely of canonical tokens\. Crucially, the determinism of BPE establishes abijective mappingbetween non\-atomic canonical tokens and the specific merge rules that produce them, which we termcanonical rules\. Thenormalized dictionary𝒟\\mathcal\{D\}contains strictly these rules\. We show in Appendix[C](https://arxiv.org/html/2605.30813#A3)that this normalization preserves exact compatibility, i\.e\.,𝕋𝒟​\(s\)≡𝕋D​\(s\)\\mathbb\{T\}\_\{\\mathcal\{D\}\}\(s\)\\equiv\\mathbb\{T\}\_\{D\}\(s\)\.

#### Predecessors and Successors\.

Based on the bijection above, for every non\-atomict∈𝒱t\\in\\mathcal\{V\}produced by the canonical rule\(x,y\)\(x,y\)oftt, we define itspredecessorpre⁡\(t\)=x\\operatorname\{pre\}\(t\)=xandsuccessorsuc⁡\(t\)=y\\operatorname\{suc\}\(t\)=y\. As detailed in Appendix[C](https://arxiv.org/html/2605.30813#A3), the determinism of BPE ensuresx,y∈𝒱x,y\\in\\mathcal\{V\}\. Thus, the vocabulary is closed under these operations, forming the topological basis for the Successor Forest\.

### 3\.4Successor Forest

![Refer to caption](https://arxiv.org/html/2605.30813v1/x2.png)Figure 2:Successor Forest and the merge rules\.Solid edges denote the Successor Forest, while dashed edges indicate predecessors\.Violetnodes highlightSufSucTree⁡\(abcdef\)\\operatorname\{SufSucTree\}\(\\texttt\{abcdef\}\)\. Other colored nodes, formed by the predecessors and some of their subtrees, identify the tokensθ​\(s−suc⁡\(t\)\)\\theta\(s^\{\-\\operatorname\{suc\}\(t\)\}\)satisfying thePrefix Last\-Token Condition\(Definition[4\.1](https://arxiv.org/html/2605.30813#S4.Thmtheorem1)\) for the tokenstt:ef,def,bcdef, andabcdef\.Based on the hierarchy, we construct the topological structure for searchingθ​\(⋅\)\\theta\(\\cdot\)\. We define theSuccessor Forest\(ℱsuc\\mathcal\{F\}\_\{\\operatorname\{suc\}\}\) as a directed graphG=\(𝒱,E\)G=\(\\mathcal\{V\},E\)where edges point from a token to its successor:E=\{\(u,suc⁡\(u\)\)∣u∈𝒱,\|u\|\>1\}E=\\\{\(u,\\operatorname\{suc\}\(u\)\)\\mid u\\in\\mathcal\{V\},\|u\|\>1\\\}\. Since edges strictly reduce token length \(\|suc⁡\(u\)\|<\|u\|\|\\operatorname\{suc\}\(u\)\|<\|u\|\),GGis acyclic and forms a forest of trees rooted at atomic tokens\.

An example Successor Forest is illustrated in[Figure2](https://arxiv.org/html/2605.30813#S3.F2)\.

### 3\.5Suffix\-Successor Tree

By definition, for any non\-atomic tokenuu,u=pre⁡\(u\)​suc⁡\(u\)u=\\operatorname\{pre\}\(u\)\\operatorname\{suc\}\(u\), which impliessuc⁡\(u\)\\operatorname\{suc\}\(u\)is a proper suffix ofuu\. This property ensures that traversing parents inℱsuc\\mathcal\{F\}\_\{\\operatorname\{suc\}\}corresponds to taking proper suffixes\.

For a given tokent∈𝒱t\\in\\mathcal\{V\}, we define theSuffix\-Successor Tree, denoted asSufSucTree⁡\(t\)\\operatorname\{SufSucTree\}\(t\), as the subgraph ofℱsuc\\mathcal\{F\}\_\{\\operatorname\{suc\}\}induced by the set of all canonical tokens that are suffixes oftt\.

LetStS\_\{t\}be the set of suffix tokens oftt\(note thatt∈Stt\\in S\_\{t\}\)\. If a non\-atomic tokenu∈Stu\\in S\_\{t\}, then its parent in the forest,suc⁡\(u\)\\operatorname\{suc\}\(u\), must also be a suffix oftt, implyingsuc⁡\(u\)∈St\\operatorname\{suc\}\(u\)\\in S\_\{t\}\. Since all suffixes ofttshare the same last character, all paths in this subgraph converge to the unique atomic suffix token oftt\. Therefore,SufSucTree⁡\(t\)\\operatorname\{SufSucTree\}\(t\)forms a single tree rooted at this atomic token\.

This structure organizes all potential candidates for the value ofθ​\(⋅\)\\theta\(\\cdot\)\. By identifying further properties on this tree, we can reformulate the incremental tokenization as an efficient tree search problem\.

An example Suffix\-Successor Tree is highlighted in violet in[Figure2](https://arxiv.org/html/2605.30813#S3.F2)\.

## 4The Monotonic Path Property

This section establishes the key property underlying BPE for suffix tokens\. For any input string, we will show that the suffix tokens capable of serving as the final token form a single monotonic path in the Suffix\-Successor Tree\.

### 4\.1Intuition

Fix a non\-empty stringssand a suffix tokenttofss\. Forttto be the valid last tokenθ​\(s\)\\theta\(s\), the boundaries around its componentspre⁡\(t\)\\operatorname\{pre\}\(t\)andsuc⁡\(t\)\\operatorname\{suc\}\(t\)are required to survive the merge process before applying the canonical rule oftt\(as formalized in Lemma[E\.3](https://arxiv.org/html/2605.30813#A5.Thmtheorem3)and Lemma[E\.6](https://arxiv.org/html/2605.30813#A5.Thmtheorem6)\)\.

This survival establishes a necessary dual constraint on the merge history:

1. 1\.Structural Reachability:pre⁡\(t\)\\operatorname\{pre\}\(t\)must appear as the rightmost token at some stage during the processing of the prefix, meaning the eventual last token of the prefix must be a descendant ofpre⁡\(t\)\\operatorname\{pre\}\(t\)\.
2. 2\.Procedural Priority: The merge rule formingttmust executebeforeany rule that would otherwise consumepre⁡\(t\)\\operatorname\{pre\}\(t\)\(specifically, merging it with a left neighbor\)\.

Crucially, these conditions are necessary but not sufficient, since the newly formed tokenttmay still be involved in subsequent merges before the final tokenization is reached\.

Another intuition is that the rightmost token can be viewed as evolving through the tokenization: starting from an atomic token, it is progressively merged with its adjacent token\. This process can be viewed as traversing a descendant path in the Successor Forest, until reaching the last token of the final tokenization\.

We provide a detailed derivation of this intuition based on the “boundary elimination” logic in Appendix[D](https://arxiv.org/html/2605.30813#A4)\.

### 4\.2Formalization

We now formalize the intuition from the previous subsection\. The following definition establishes a rigorous criterion for determining whether a candidate suffix tokenttis compatible with the tokenization of the remaining prefix\.

###### Definition 4\.1\(Prefix Last\-Token Condition\)\.

Fix a non\-empty stringssand a suffix tokenttofss\.

Ifttis anatomic token, it satisfies thePrefix Last\-Token Conditionby definition\.

Ifttis anon\-atomic token, lets−suc⁡\(t\)s^\{\-\\operatorname\{suc\}\(t\)\}denote the prefix obtained by removing the suffixsuc⁡\(t\)\\operatorname\{suc\}\(t\)fromss\. We say thatttsatisfies the condition with respect tossif the following hold:

1. 1\.Reachability\.The tokenθ​\(s−suc⁡\(t\)\)\\theta\(s^\{\-\\operatorname\{suc\}\(t\)\}\)lies in the subtree of the Successor Forest rooted atpre⁡\(t\)\\operatorname\{pre\}\(t\)\(including the caseθ​\(s−suc⁡\(t\)\)=pre⁡\(t\)\\theta\(s^\{\-\\operatorname\{suc\}\(t\)\}\)=\\operatorname\{pre\}\(t\)\)\.
2. 2\.Priority dominance\.Ifθ​\(s−suc⁡\(t\)\)≠pre⁡\(t\)\\theta\(s^\{\-\\operatorname\{suc\}\(t\)\}\)\\neq\\operatorname\{pre\}\(t\), letuube the unique child ofpre⁡\(t\)\\operatorname\{pre\}\(t\)that lies on the ancestor path fromθ​\(s−suc⁡\(t\)\)\\theta\(s^\{\-\\operatorname\{suc\}\(t\)\}\)topre⁡\(t\)\\operatorname\{pre\}\(t\)\. Then the canonical rule oftthas strictly higher priority than that ofuu\.

We now assert the main structural theorem of this paper\.

###### Theorem 4\.2\(Monotonic Path Property\)\.

Letkkbe the longest canonical suffix token of a stringss\. Consider the Suffix\-Successor TreeSufSucTree⁡\(k\)\\operatorname\{SufSucTree\}\(k\)\. LetPPdenote the unique path from the last tokenθ​\(s\)\\theta\(s\)to the root\. Then a nodet∈SufSucTree⁡\(k\)t\\in\\operatorname\{SufSucTree\}\(k\)satisfies the Prefix Last\-Token Condition if and only ifttlies on the unique pathPP\.

We provide the complete proof in Appendix[E](https://arxiv.org/html/2605.30813#A5)\. The “if” direction follows directly from the intuition above: along the successor chain of the true last token, both reachability and priority dominance are preserved monotonically\.

It is notable that, although the Monotonic Path Property restricts the search space to the Suffix\-Successor Tree of the longest suffix token,the Prefix Last\-Token Condition itself is evaluated in the global Successor Forest\. In particular, for a candidate tokentt, the condition depends only on the relative position betweenpre⁡\(t\)\\operatorname\{pre\}\(t\)andθ​\(s−suc⁡\(t\)\)\\theta\(s^\{\-\\operatorname\{suc\}\(t\)\}\)in the Successor Forest, and does not depend on the membership ofpre⁡\(t\)\\operatorname\{pre\}\(t\)inSufSucTree⁡\(t\)\\operatorname\{SufSucTree\}\(t\)\.

### 4\.3Linearization via DFS

To utilize the Monotonic Path Property to find the last tokenθ​\(s\)\\theta\(s\), it requires an efficient method to verify whether a nodet∈SufSucTree⁡\(k\)t\\in\\operatorname\{SufSucTree\}\(k\)satisfies the Prefix Last\-Token Condition \(Definition[4\.1](https://arxiv.org/html/2605.30813#S4.Thmtheorem1)\)\.

For a fixed non\-atomic tokentt, consider the possible values ofθ​\(s−suc⁡\(t\)\)\\theta\(s^\{\-\\operatorname\{suc\}\(t\)\}\)that satisfy the condition\. This set of valid tokens forms the subtree rooted atpre⁡\(t\)\\operatorname\{pre\}\(t\)in the Successor Forest dictated by the “Reachability” requirement, excluding the subtrees rooted at any direct child ofpre⁡\(t\)\\operatorname\{pre\}\(t\)whose canonical rule has a priority greater than or equal to that oftt, as enforced by the “Priority dominance”\.

Let𝒞t\\mathcal\{C\}\_\{t\}be the set described above for a non\-atomic tokentt\. Ifttis an atomic token,𝒞t\\mathcal\{C\}\_\{t\}is simply defined as an empty set\. In this section, we show that membership in𝒞t\\mathcal\{C\}\_\{t\}can be evaluated in𝒪​\(1\)\\mathcal\{O\}\(1\)time with DFS linearization\.

#### DFS Linearization of the Successor Forest\.

To linearize the forest topology, we perform a pre\-order Depth\-First Search \(DFS\) starting from the atomic tokens\. We maintain a global counter that increments exactly once upon entering each nodeuu, assigning it a unique entry timestampdfs​\_​in⁡\(u\)\\operatorname\{dfs\\\_in\}\(u\)\. We then assign an exit timestampdfs​\_​out⁡\(u\)\\operatorname\{dfs\\\_out\}\(u\)immediately after fully exploring its subtree\.

This process establishes a bijection between the nodes and the timestamps\. Crucially, it guarantees that the subtree rooted at any nodeuucorresponds exactly to a contiguous half\-open interval\[dfs​\_​in⁡\(u\),dfs​\_​out⁡\(u\)\)\[\\operatorname\{dfs\\\_in\}\(u\),\\operatorname\{dfs\\\_out\}\(u\)\)\.

Additionally, when traversing the children of a nodeuu, we visit them in strictly increasing order of their canonical rule priorities \(from lowest to highest priority\)\. This deliberate ordering ensures that children with lower priorities are mapped to earlier timestamps, enabling us to efficiently exclude the subtrees of higher\-priority children to enforce the “Priority dominance” condition\.

#### Valid Interval\.

Notably, the set𝒞t\\mathcal\{C\}\_\{t\}corresponds to a contiguous range of timestamps\. For each non\-atomic canonical tokentt, we precompute a*valid interval*It=\[Lt,Rt\)I\_\{t\}=\[L\_\{t\},R\_\{t\}\)such that a nodex∈𝒞tx\\in\\mathcal\{C\}\_\{t\}if and only ifdfs​\_​in⁡\(x\)∈It\\operatorname\{dfs\\\_in\}\(x\)\\in I\_\{t\}\. Concretely:

- •Lt=dfs​\_​in⁡\(pre⁡\(t\)\)L\_\{t\}=\\operatorname\{dfs\\\_in\}\(\\operatorname\{pre\}\(t\)\);
- •Rt=dfs​\_​in⁡\(u\)R\_\{t\}=\\operatorname\{dfs\\\_in\}\(u\), whereuuis the first child ofpre⁡\(t\)\\operatorname\{pre\}\(t\)\(in the DFS traversal order\) whose canonical rule priority is greater than or equal to that oftt\. If no such child exists, we setRt=dfs​\_​out⁡\(pre⁡\(t\)\)R\_\{t\}=\\operatorname\{dfs\\\_out\}\(\\operatorname\{pre\}\(t\)\)\.

The construction above reduces the evaluation of the Prefix Last\-Token Condition to an𝒪​\(1\)\\mathcal\{O\}\(1\)interval membership test\.

#### Mutual Exclusion among Siblings\.

Finally, it is notable that the valid intervals of sibling nodes in the Suffix\-Successor Tree are necessarily disjoint\. This is a direct corollary of the Monotonic Path Property \([Theorem4\.2](https://arxiv.org/html/2605.30813#S4.Thmtheorem2)\)\. If the intervals of two distinct siblings overlapped, there would exist a prefix state for which both siblings satisfy the Prefix Last\-Token Condition, implying the existence of valid nodes on two diverging branches\. This contradicts the uniqueness of the valid path guaranteed by[Theorem4\.2](https://arxiv.org/html/2605.30813#S4.Thmtheorem2)\.

Thus, for any two distinct childrenuuandvvof a node, we haveIu∩Iv=∅I\_\{u\}\\cap I\_\{v\}=\\emptyset\. This mutual exclusion enables the search algorithm to unambiguously identify the correct branch at each step\.

## 5Incremental Algorithm

Based on the theoretical foundations established in the previous sections, we now present the incremental algorithm for maintaining the last tokenθ​\(s\)\\theta\(s\)\. In this section, we first outline the abstract algorithmic framework and its design objectives\. We then detail the specific data structures, the Aho–Corasick automaton\(Aho and Corasick,[1975](https://arxiv.org/html/2605.30813#bib.bib4)\)and Centroid Decomposition, employed to implement each component efficiently\.

### 5\.1Algorithm Framework

The incremental update problem can be framed as follows: given the current stringss, its last tokenθ​\(s\)\\theta\(s\), and a new charactercc, we aim to compute the new last tokenθ​\(s​c\)\\theta\(sc\)\.

Our strategy relies on the structural properties of the search space\. First, recall thatθ​\(s​c\)\\theta\(sc\)must be asuffix tokenof the strings​csc\. While there may be many suffix tokens, they are all suffixes of thelongest suffix token, denoted asτ​\(s​c\)\\tau\(sc\)\. Consequently, the Suffix\-Successor TreeSufSucTree⁡\(τ​\(s​c\)\)\\operatorname\{SufSucTree\}\(\\tau\(sc\)\)defines oursearch space\.

Within this search space, theMonotonic Path Property\([Theorem4\.2](https://arxiv.org/html/2605.30813#S4.Thmtheorem2)\) guarantees that the set of nodes satisfying the Prefix Last\-Token Condition forms a single continuous path anchored at the root \(the atomic tokencc\)\. Since the atomic token is always a valid candidate, this path is never empty\. The correct last tokenθ​\(s​c\)\\theta\(sc\)corresponds to thedeepest nodeof this valid path\.

Therefore, the algorithm proceeds in two abstract steps:

1. 1\.Search Space Identification: Identify the longest suffix tokenτ​\(s​c\)\\tau\(sc\)to determine the bounding treeSufSucTree⁡\(τ​\(s​c\)\)\\operatorname\{SufSucTree\}\(\\tau\(sc\)\)\.
2. 2\.Path Extension Search: WithinSufSucTree⁡\(τ​\(s​c\)\)\\operatorname\{SufSucTree\}\(\\tau\(sc\)\), search for the deepest node satisfying the Prefix Last\-Token Condition\.

To evaluate the Prefix Last\-Token Condition during the search, the algorithm requires access to the last tokens of previous prefixes\. We abstract this as ahistory interfaceℋ​\(ℓ\)\\mathcal\{H\}\(\\ell\), which returns the last token of the prefix obtained by removingℓ\\ellcharacters from the end of the current string\.

### 5\.2Search Space Maintenance via Aho–Corasick Automaton

To identifyτ​\(s​c\)\\tau\(sc\), we maintain an Aho–Corasick automaton\(Aho and Corasick,[1975](https://arxiv.org/html/2605.30813#bib.bib4)\)on𝒱\\mathcal\{V\}\. We augment the structure by precomputing the search space entry point: for each stateuu, the annotation is the tokenuuitself ifu∈𝒱u\\in\\mathcal\{V\}, otherwise it is inherited fromuu’s suffix link\. This allows retrievingτ​\(s​c\)\\tau\(sc\)in𝒪​\(1\)\\mathcal\{O\}\(1\)without traversing suffix links\. We store the transition table using a persistent square\-root tiling \(details in Appendix[F](https://arxiv.org/html/2605.30813#A6)\) to ensure𝒪​\(1\)\\mathcal\{O\}\(1\)transition queries with memory efficiency\. The annotation at the new state identifies the specificSufSucTree⁡\(⋅\)\\operatorname\{SufSucTree\}\(\\cdot\)for the subsequent search\.

### 5\.3Efficient Search via Centroid Decomposition

The main challenge is to efficiently locate the deepest valid node withinSufSucTree⁡\(τ\)\\operatorname\{SufSucTree\}\(\\tau\), whereτ=τ​\(s​c\)\\tau=\\tau\(sc\)\. Since the tree height can be linear in\|τ\|\|\\tau\|, a naive top\-down traversal is inefficient\. We solve this by guiding the search usingCentroid Decomposition\.

For each canonical tokenτ∈𝒱\\tau\\in\\mathcal\{V\}, we precompute aCentroid Search Tree\(CST\)\. The CST is a tree of height𝒪​\(log⁡\|τ\|\)\\mathcal\{O\}\(\\log\|\\tau\|\), where each node corresponds to a centroiduuof a weak component in the recursive decomposition ofSufSucTree⁡\(τ\)\\operatorname\{SufSucTree\}\(\\tau\)\. Crucially, from the perspective of a centroiduu, the original tree is split into disjoint weakly connected components: one component containing the parent ofuu\(the “upward” direction\), and several components each containing a child ofuu\(the “downward” directions\)\.

The search forθ​\(s​c\)\\theta\(sc\)proceeds by traversing this CST\. At each step, letuube the current centroid\. We evaluate the Prefix Last\-Token Condition foruuby querying the history statek=ℋ​\(\|suc⁡\(u\)\|\)k=\\mathcal\{H\}\(\|\\operatorname\{suc\}\(u\)\|\)and performing the interval checkdfs​\_​in⁡\(k\)∈Iu\\operatorname\{dfs\\\_in\}\(k\)\\in I\_\{u\}\. Based on the Monotonic Path Property, we determine the direction of the valid path’s endpoint:

1. 1\.Case 1:uuis invalid \(dfs​\_​in⁡\(k\)∉Iu\\operatorname\{dfs\\\_in\}\(k\)\\notin I\_\{u\}\)\.Since the valid path is anchored at the tree root \(the atomic token\), an invalid nodeuuimplies that the path does not intersect with the subtree ofSufSucTree⁡\(τ\)\\operatorname\{SufSucTree\}\(\\tau\)rooted atuu; i\.e\., the path must be inside the component of the parent\. We transition to the CST child corresponding to the component containinguu’s parent \(the upward direction\)\.
2. 2\.Case 2:uuis valid \(dfs​\_​in⁡\(k\)∈Iu\\operatorname\{dfs\\\_in\}\(k\)\\in I\_\{u\}\)\.The nodeuulies on the valid path\. The targetθ​\(s​c\)\\theta\(sc\)is eitheruuitself or its descendant\. To determine if the path extends deeper, we check the children ofuuin the originalSufSucTree⁡\(τ\)\\operatorname\{SufSucTree\}\(\\tau\)\. Recall from[Section4\.3](https://arxiv.org/html/2605.30813#S4.SS3)that the valid intervals of siblings are disjoint\. We can thus pre\-sort the intervals and perform a binary search over the children to check if any childvvis valid\. - •If a valid childvvexists \(i\.e\.,dfs​\_​in⁡\(ℋ​\(\|u\|\)\)∈Iv\\operatorname\{dfs\\\_in\}\(\\mathcal\{H\}\(\|u\|\)\)\\in I\_\{v\}\), the path continues intovv’s subtree\. We transition to the CST child corresponding to the component containingvv\. - •If no child is valid, then the path ends atuu\. We concludeθ​\(s​c\)=u\\theta\(sc\)=uand terminate the search\.

### 5\.4Complexity Analysis

The worst\-case time complexity per byte is dominated by the centroid search\.

- •Automaton State Update: The optimized automaton transition takes𝒪​\(1\)\\mathcal\{O\}\(1\)time\.
- •CST Traversal: The CST height is bounded by𝒪​\(log⁡\|τ\|\)\\mathcal\{O\}\(\\log\|\\tau\|\)\.
- •Decision Step: At each CST node, we perform either a single interval check or a binary search over the children\. Since the degree of a node inSufSucTree⁡\(τ\)\\operatorname\{SufSucTree\}\(\\tau\)is bounded by\|τ\|\|\\tau\|, the binary search takes𝒪​\(log⁡\|τ\|\)\\mathcal\{O\}\(\\log\|\\tau\|\)\.

Each check involves a history queryℋ​\(⋅\)\\mathcal\{H\}\(\\cdot\), assumed to be an𝒪​\(1\)\\mathcal\{O\}\(1\)operation\. Combining these factors, the complexity per byte is𝒪​\(log2⁡\|τ\|\)\\mathcal\{O\}\(\\log^\{2\}\|\\tau\|\), leading to an overall complexity of𝒪​\(n​log2⁡t\)\\mathcal\{O\}\(n\\log^\{2\}t\)for the entire input, wherettis the maximum token length\.

## 6Eager Output

The incremental algorithm efficiently maintains the stateθ​\(s\)\\theta\(s\)\. Since the sequence of last tokens implicitly encodes thePrefix Tree of Tokens, the full tokenization is typically recovered bybacktrackingfromθ​\(s\)\\theta\(s\)to the virtual root\.

In a streaming setting, however, incrementality alone does not guarantee streaming output\. Thus, the challenge is to identify theinitial segmentof this backtracking chain that has become invariant, serving as the common ancestry for any future extension of the string\.

### 6\.1The Active Frontier

The ambiguity of the output stream arises from the uncertainty of which current node in the Prefix Tree of Tokens will serve as the parent for future tokens\. When new characters are appended toss, any newly formed last token must attach to the last token of some prefix ofss\.

The range of such possible attachments is bounded by the maximum suffix length queried during the incremental search, which corresponds to the deepest state reached in the Aho–Corasick automaton\. Letd​\(s\)d\(s\)be thedepthof the state reached by the stringssin the Aho–Corasick automaton\. Note that this depth corresponds exactly to the length of the longest suffix ofssrecognized by the automaton\. Since any future token formed must be recognized by the automaton, its length cannot exceedd​\(s\)d\(s\)\.

Consequently, the parent of any future token must be the last token of a prefix ending in the interval\[\|s\|−d​\(s\),\|s\|\]\[\|s\|\-d\(s\),\|s\|\]\.

We thus define the set ofParental Candidates,𝒫\\mathcal\{P\}, as the set of all last tokens that fall within this interval:

𝒫=\{θ​\(s​\[1​…​i\]\)∣\|s\|−d​\(s\)≤i≤\|s\|\}\\mathcal\{P\}=\\\{\\theta\(s\[1\\dots i\]\)\\mid\|s\|\-d\(s\)\\leq i\\leq\|s\|\\\}\(3\)wheres​\[1​…​0\]s\[1\\dots 0\]indicates the empty stringε\\varepsilonandθ​\(ε\)\\theta\(\\varepsilon\)is the virtual root of the Prefix Tree of Tokens\. The stable output is thecommon ancestral pathshared by all tokens in𝒫\\mathcal\{P\}\.

### 6\.2Maintenance Algorithm

To efficiently identify and emit the stable tokens, we exploit the monotonicity of the window boundaries\. Observe that the automaton transition for a new character can increase the state depth by at most 1 \(i\.e\.,d​\(s​c\)≤d​\(s\)\+1d\(sc\)\\leq d\(s\)\+1\)\. Therefore, the start of the window,\|s\|−d​\(s\)\|s\|\-d\(s\), is monotonically non\-decreasing\. This guarantees that candidates only “expire” from𝒫\\mathcal\{P\}as the window slides forward; previously expired tokens never re\-enter\.

Based on this property, we implement a two\-pointer algorithm to maintain the set𝒫\\mathcal\{P\}and track the active subgraph in the Prefix Tree of Tokens\. Tokens are emitted eagerly once all active paths converge to a single child of the virtual root\. The detailed algorithmic steps and complexity analysis are provided in Appendix[G](https://arxiv.org/html/2605.30813#A7)\.

While eager output enables streaming, it introduces computational overhead due to maintaining the dynamic subgraph\. Therefore, unless otherwise specified, the standard benchmarks presented later utilize the non\-eager incremental algorithm\.

![Refer to caption](https://arxiv.org/html/2605.30813v1/x3.png)Figure 3:End\-to\-end throughput under pathological inputs \(log–log scale\)\.Filled markers represent measured results, while hollow markers indicate extrapolated values beyond the measurable range\. The dashedverticalline highlights the input length at which the regular expression matching fails before BPE\. Our approach demonstrates higher and more stable throughput compared to baseline methods\. Notably,tiktokenexhibits a characteristic decay as input length increases, reflecting its underlying𝒪​\(n2\)\\mathcal\{O\}\(n^\{2\}\)complexity\.

## 7Benchmarks

We evaluate the end\-to\-end throughput of our incremental BPE algorithm by integrating it as a drop\-in replacement for Hugging Face’stokenizersand OpenAI’stiktoken\. Since our algorithm is fully compatible with existing pipelines, the benchmarks isolate the effect of replacing the BPE stage alone, without altering segmentation, normalization, or caching behavior\.

Detailed experimental setup, dataset composition, and execution strategies are provided in Appendix[H](https://arxiv.org/html/2605.30813#A8), while thefull benchmark results\(including eager output\) are listed inLABEL:tab:detailed\-benchmarks\.

### 7\.1Throughput Analysis

[Table1](https://arxiv.org/html/2605.30813#S7.T1)reports the speedup factors evaluated on concatenated documents from each dataset\.

#### Absence of Pre\-tokenization\.

The most substantial gains are observed inCodeLlama\(up to3\.13×\\times\), where the BPE processes normalized text without regex\-based pre\-tokenization\. In this regime, the heap\-based implementation intokenizersfaces log\-linear complexity over the full input length, whereas our algorithm leverages its incremental nature to maintain high throughput\.

#### Coarse\-Grained Segmentation\.

For theChinesedataset, regex\-based pre\-tokenization typically yields coarser segments than English due to the specific designs of the regex rules and the characteristics of the language\.

This exposes the performance bottlenecks in the original BPE implementation oftiktoken, while our method achieves significant speedups \(up to 1\.59×\\times\) by strictly bounding the per\-byte processing time\.

#### Fine\-Grained Segmentation\.

For theEnglishdataset, regex rules effectively partition the input into small chunks, where the constant factors of the implementation overhead become dominant\. This explains the slight regression of our method observed in some cases\.

Table 1:End\-to\-end throughput speedup\.We report the speedup factor of our incremental BPE as a drop\-in replacement for Hugging Face’stokenizersand OpenAI’stiktoken\(using default configurations with short\-string caches enabled where applicable\)\.CodeLlamashows the highest gains due to the absence of regex\-based pre\-tokenization\.Tokenizer ModelEnglishChineseCode\\rowcolorgray\!10\\cellcolorwhitetokenizersCodeLlama3\.13×\\times1\.10×\\times2\.88×\\timesQwen\-31\.05×\\times1\.04×\\times1\.08×\\timesDeepSeek\-3\.21\.01×\\times0\.93×\\times1\.03×\\timesOuro1\.00×\\times1\.02×\\times1\.02×\\timesLlama\-3\.1\*0\.99×\\times1\.03×\\times1\.02×\\timesGPT\-OSS1\.00×\\times1\.08×\\times1\.01×\\timesLlama\-41\.01×\\times1\.01×\\times1\.00×\\timesMistral\-31\.00×\\times1\.04×\\times0\.99×\\times\\cellcolorwhitetiktokenP50K0\.97×\\times1\.35×\\times1\.07×\\timesR50K0\.96×\\times1\.35×\\times1\.05×\\timesCL100K0\.96×\\times1\.59×\\times1\.04×\\timesO200K0\.99×\\times1\.46×\\times1\.00×\\times
\*Properized dictionary\. See Appendix[A](https://arxiv.org/html/2605.30813#A1)for discussion on compatibility and non\-properizable cases \(e\.g\., Gemma\-3\)\.

### 7\.2Robustness and Eager Output

#### Pathological Inputs\.

We test robustness using inputs constructed by repeating the character ‘a’2k2^\{k\}times\. As shown in[Figure3](https://arxiv.org/html/2605.30813#S6.F3)and[Figure7](https://arxiv.org/html/2605.30813#A9.F7),tiktokenexhibits a characteristic throughput decay consistent with𝒪​\(n2\)\\mathcal\{O\}\(n^\{2\}\)complexity\.

Our method maintains stable throughput throughout the tests\. The other tested tokenizers within the same frameworks exhibit similar trend characteristics to the representative models shown\. Notably, specifically for the O200K model as an exception, long inputs trigger errors in the regex stage before BPE\.

#### Eager Output Overhead\.

We also evaluate oureager outputimplementation \([Section6](https://arxiv.org/html/2605.30813#S6)\)\. Results in Appendix[H](https://arxiv.org/html/2605.30813#A8)indicate that eager output introduces an overhead on the order of 10% in end\-to\-end throughput compared to the non\-eager incremental implementation\. This overhead stems from the additional bookkeeping required to maintain output states, and can potentially be amortized in systems where tokenization is pipelined with I/O or model inference\.

## 8Future Work

While this work demonstrates that the BPE stage itself is fully compatible with streaming input and output, other components in existing pipelines, notably normalization and regex\-based pre\-tokenization, remain offline\-oriented and impede end\-to\-end streaming execution\.

Our profiling results in Appendix[I](https://arxiv.org/html/2605.30813#A9)indicate that these upstream stages can become computational bottlenecks\.

A critical direction for future work is therefore torevisit the design choices of pre\-tokenizationin light of incremental algorithms with explicit worst\-case guarantees\.

## 9Conclusion

We introduced a novel algorithm for incremental BPE tokenization with a strictworst\-case time complexityof𝒪​\(n​log2⁡t\)\\mathcal\{O\}\(n\\log^\{2\}t\)\. Implemented as a drop\-in replacement, the algorithm delivers substantial end\-to\-end speedups over existing systems and avoids the severe performance degradation on pathological inputs\. Beyond performance, our results show that BPE can be made compatible with streaming input and output, enabling optimizations in modern language model systems\.

Overall, our approach proves that strict algorithmic worst\-case guarantees can translate into practical benefits, paving the way for more robust and responsive LLM pipelines\.

## Acknowledgements

This work was supported by the National Natural Science Foundation of China \(Nos\. 62525601, 62476018\) and the Postdoctoral Fellowship Program of CPSF \(No\. BX20250487\)\.

## Impact Statement

This paper presents work whose goal is to advance the field of machine learning by improving the fundamental algorithmic infrastructure of large language models \(LLMs\)\.

#### Energy Efficiency\.

Tokenization is a ubiquitous preprocessing step executed trillions of times in modern LLM pipelines\. By reducing the algorithmic worst\-case time complexity of BPE to𝒪​\(n​log2⁡t\)\\mathcal\{O\}\(n\\log^\{2\}t\), utilizing incremental updates for streaming input, and enabling eager emission for streaming output, our approach reduces the computational resources and latency required for processing\. This optimization contributes to the broader goal of “Green AI,” potentially lowering the aggregate energy consumption and carbon footprint associated with deploying large\-scale language models\.

#### System Robustness and Reliability\.

Our work specifically addresses performance degradation on pathological inputs, a vulnerability present in some current state\-of\-the\-art tokenizers\. By providing strict worst\-case execution guarantees, our algorithm mitigates the risk of algorithmic complexity attacks \(e\.g\., Denial\-of\-Service attacks triggered by carefully crafted input sequences\), thereby enhancing the reliability and security of production systems\.

We do not foresee any specific negative societal consequences directly stemming from this algorithmic improvement, beyond the general dual\-use nature of efficient computing infrastructure\.

## References

- A\. V\. Aho and M\. J\. Corasick \(1975\)Efficient string matching: an aid to bibliographic search\.Communications of the ACM18\(6\),pp\. 333–340\.External Links:[Document](https://dx.doi.org/10.1145/360825.360855)Cited by:[Appendix F](https://arxiv.org/html/2605.30813#A6.p1.1),[§2](https://arxiv.org/html/2605.30813#S2.p7.1),[§5\.2](https://arxiv.org/html/2605.30813#S5.SS2.p1.10),[§5](https://arxiv.org/html/2605.30813#S5.p1.1)\.
- M\. Berglund and B\. van der Merwe \(2023\)Formalizing BPE tokenization\.Electronic Proceedings in Theoretical Computer Science388,pp\. 16–27\.External Links:[Document](https://dx.doi.org/10.4204/eptcs.388.4)Cited by:[§A\.1](https://arxiv.org/html/2605.30813#A1.SS1.p1.1),[§A\.1](https://arxiv.org/html/2605.30813#A1.SS1.p2.4),[§1](https://arxiv.org/html/2605.30813#S1.p4.1),[§2](https://arxiv.org/html/2605.30813#S2.p6.1),[§3\.2](https://arxiv.org/html/2605.30813#S3.SS2.SSS0.Px1.p1.1)\.
- ByteDance \(2025a\)Ouro tokenizer\.Note:Hugging Face model repositoryAccessed: 2026\-01\-01External Links:[Link](https://huggingface.co/ByteDance/Ouro-2.6B)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.9.8.2)\.
- ByteDance \(2025b\)Scaling latent reasoning via looped language models\.External Links:2510\.25741,[Link](https://arxiv.org/abs/2510.25741)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.9.8.1)\.
- DeepSeek\-AI \(2025a\)DeepSeek\-V3\.2 tokenizer\.Note:Hugging Face model repositoryAccessed: 2026\-01\-01External Links:[Link](https://huggingface.co/deepseek-ai/DeepSeek-V3.2)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.3.2.2)\.
- DeepSeek\-AI \(2025b\)DeepSeek\-V3\.2: pushing the frontier of open large language models\.External Links:2512\.02556,[Link](https://arxiv.org/abs/2512.02556)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.3.2.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§2](https://arxiv.org/html/2605.30813#S2.p1.1)\.
- P\. Gage \(1994\)A new algorithm for data compression\.The C Users Journal12\(2\),pp\. 23–38\.External Links:[Link](https://dl.acm.org/doi/10.5555/177910.177914)Cited by:[§1](https://arxiv.org/html/2605.30813#S1.p1.1)\.
- GitHub \(2026\)rust\-gems: a collection of Rust algorithms and data structures\.Note:GitHub repositoryAccessed: 2026\-03\-31External Links:[Link](https://github.com/github/rust-gems)Cited by:[Appendix J](https://arxiv.org/html/2605.30813#A10.p1.1),[§2](https://arxiv.org/html/2605.30813#S2.p7.1)\.
- J\. Gjengset \(2026\)inferno: a Rust port of flamegraph\.Note:GitHub repositoryAccessed: 2026\-01\-01External Links:[Link](https://github.com/jonhoo/inferno)Cited by:[Appendix I](https://arxiv.org/html/2605.30813#A9.p1.1)\.
- Google \(2025a\)Gemma 3 technical report\.External Links:[Link](https://goo.gle/Gemma3Report)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.4.3.1)\.
- Google \(2025b\)Gemma\-3 tokenizer\.Note:Hugging Face model repositoryAccessed: 2026\-01\-01External Links:[Link](https://huggingface.co/google/gemma-3-27b-it)Cited by:[§A\.6](https://arxiv.org/html/2605.30813#A1.SS6.p3.1),[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.4.3.2)\.
- Hugging Face \(2026\)Tokenizers: fast state\-of\-the\-art tokenizers optimized for research and production\.Note:GitHub repositoryAccessed: 2026\-01\-01External Links:[Link](https://github.com/huggingface/tokenizers)Cited by:[§1](https://arxiv.org/html/2605.30813#S1.p1.1),[§2](https://arxiv.org/html/2605.30813#S2.p3.1)\.
- S\. Kanda, K\. Akabe, and Y\. Oda \(2023\)Engineering faster double\-array Aho–Corasick automata\.Software: Practice and Experience53\(6\),pp\. 1332–1361\.External Links:[Document](https://dx.doi.org/10.1002/spe.3190)Cited by:[§J\.3](https://arxiv.org/html/2605.30813#A10.SS3.SSS0.Px1.p1.5)\.
- T\. Kudo and J\. Richardson \(2018\)SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing\.InProceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations,pp\. 66–71\.External Links:[Document](https://dx.doi.org/10.18653/v1/D18-2012)Cited by:[item 2](https://arxiv.org/html/2605.30813#A1.I1.i2.p1.1),[§3\.1](https://arxiv.org/html/2605.30813#S3.SS1.p7.1)\.
- T\. Kudo \(2018\)Subword regularization: improving neural network translation models with multiple subword candidates\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),I\. Gurevych and Y\. Miyao \(Eds\.\),Melbourne, Australia,pp\. 66–75\.External Links:[Document](https://dx.doi.org/10.18653/v1/P18-1007)Cited by:[§2](https://arxiv.org/html/2605.30813#S2.p1.1)\.
- D\. Leijen, B\. Zorn, and L\. de Moura \(2019\)Mimalloc: free list sharding in action\.InProgramming Languages and Systems,A\. W\. Lin \(Ed\.\),pp\. 244–265\.External Links:[Document](https://dx.doi.org/10.1007/978-3-030-34175-6%5F13)Cited by:[§H\.1](https://arxiv.org/html/2605.30813#A8.SS1.p2.1)\.
- Meta AI \(2023\)Code Llama: open foundation models for code\.External Links:2308\.12950,[Link](https://arxiv.org/abs/2308.12950)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.2.1.1)\.
- Meta AI \(2024\)The Llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.6.5.1)\.
- Meta AI \(2025\)The Llama 4 herd: the beginning of a new era of natively multimodal ai innovation\.External Links:[Link](https://ai.meta.com/blog/llama-4-multimodal-intelligence/)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.7.6.1)\.
- Meta Llama \(2023\)Code Llama tokenizer\.Note:Hugging Face model repositoryAccessed: 2026\-01\-01External Links:[Link](https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.2.1.2)\.
- Meta Llama \(2024\)Llama\-3\.1 tokenizer\.Note:ModelScope model repositoryAccessed: 2026\-01\-01External Links:[Link](https://modelscope.cn/models/LLM-Research/Meta-Llama-3.1-8B)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.6.5.2)\.
- Meta Llama \(2025\)Llama\-4 tokenizer\.Note:ModelScope model repositoryAccessed: 2026\-01\-01External Links:[Link](https://modelscope.cn/models/LLM-Research/Llama-4-Maverick-17B-128E)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.7.6.2)\.
- Mistral AI \(2025a\)Mistral Large 3: a state\-of\-the\-art open model\.External Links:[Link](https://mistral.ai/news/mistral-3/)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.8.7.1)\.
- Mistral AI \(2025b\)Mistral\-Large\-3 tokenizer\.Note:Hugging Face model repositoryAccessed: 2026\-01\-01External Links:[Link](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.8.7.2)\.
- OpenAI \(2025a\)gpt\-oss\-120b & gpt\-oss\-20b model card\.External Links:2508\.10925,[Link](https://arxiv.org/abs/2508.10925)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.5.4.1)\.
- OpenAI \(2025b\)gpt\-oss\-120b tokenizer\.Note:Hugging Face model repositoryAccessed: 2026\-01\-01External Links:[Link](https://huggingface.co/openai/gpt-oss-120b)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.5.4.2)\.
- OpenAI \(2026a\)OpenAI GPT\-5 system card\.External Links:2601\.03267,[Link](https://arxiv.org/abs/2601.03267)Cited by:[§2](https://arxiv.org/html/2605.30813#S2.p1.1)\.
- OpenAI \(2026b\)tiktoken: a fast BPE tokenizer for use with OpenAI’s models\.Note:GitHub repositoryAccessed: 2026\-01\-01External Links:[Link](https://github.com/openai/tiktoken)Cited by:[Appendix J](https://arxiv.org/html/2605.30813#A10.p4.1),[§1](https://arxiv.org/html/2605.30813#S1.p1.1),[§2](https://arxiv.org/html/2605.30813#S2.p3.1)\.
- Purplecoin \(2026\)mimalloc\_rust: a Rust wrapper over microsoft’s mimalloc memory allocator\.Note:GitHub repositoryAccessed: 2026\-01\-01External Links:[Link](https://github.com/purpleprotocol/mimalloc_rust)Cited by:[§H\.1](https://arxiv.org/html/2605.30813#A8.SS1.p2.1)\.
- Qwen Team \(2025a\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.10.9.1),[§2](https://arxiv.org/html/2605.30813#S2.p1.1)\.
- Qwen Team \(2025b\)Qwen3 tokenizer\.Note:Hugging Face model repositoryAccessed: 2026\-01\-01External Links:[Link](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking)Cited by:[Table 3](https://arxiv.org/html/2605.30813#A8.T3.5.10.9.2)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.J\. Mach\. Learn\. Res\.21\(1\)\.External Links:[Link](http://jmlr.org/papers/v21/20-074.html)Cited by:[§2](https://arxiv.org/html/2605.30813#S2.p1.1)\.
- M\. Schuster and K\. Nakajima \(2012\)Japanese and korean voice search\.In2012 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 5149–5152\.External Links:[Document](https://dx.doi.org/10.1109/ICASSP.2012.6289079)Cited by:[§2](https://arxiv.org/html/2605.30813#S2.p1.1)\.
- R\. Sennrich, B\. Haddow, and A\. Birch \(2016\)Neural machine translation of rare words with subword units\.InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),K\. Erk and N\. A\. Smith \(Eds\.\),Berlin, Germany,pp\. 1715–1725\.External Links:[Document](https://dx.doi.org/10.18653/v1/P16-1162)Cited by:[§1](https://arxiv.org/html/2605.30813#S1.p1.1),[§2](https://arxiv.org/html/2605.30813#S2.p1.1)\.
- Together Computer \(2023a\)RedPajama\-Data\-1T dataset\.Note:Hugging Face dataset repositoryAccessed: 2026\-01\-01External Links:[Link](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)Cited by:[2nd item](https://arxiv.org/html/2605.30813#A8.I2.i2.p1.1),[Table 2](https://arxiv.org/html/2605.30813#A8.T2.5.4.3.2)\.
- Together Computer \(2023b\)RedPajama: an open source recipe to reproduce LLaMA training dataset\.Note:GitHub repositoryAccessed: 2026\-01\-01External Links:[Link](https://github.com/togethercomputer/RedPajama-Data)Cited by:[Table 2](https://arxiv.org/html/2605.30813#A8.T2.5.4.3.1)\.
- H\. van Antwerpen and A\. Neubeck \(2024\)So many tokens, so little time: introducing a faster, more flexible byte\-pair tokenizer\.Note:The GitHub BlogExternal Links:[Link](https://github.blog/ai-and-ml/llms/so-many-tokens-so-little-time-introducing-a-faster-more-flexible-byte-pair-tokenizer/)Cited by:[§2](https://arxiv.org/html/2605.30813#S2.p7.1)\.
- Wikimedia Foundation \(2023a\)Wikimedia downloads\.External Links:[Link](https://dumps.wikimedia.org/)Cited by:[Table 2](https://arxiv.org/html/2605.30813#A8.T2.5.2.1.1),[Table 2](https://arxiv.org/html/2605.30813#A8.T2.5.3.2.1)\.
- Wikimedia Foundation \(2023b\)Wikipedia dataset\.Note:Hugging Face dataset repositoryAccessed: 2026\-01\-01External Links:[Link](https://huggingface.co/datasets/wikimedia/wikipedia)Cited by:[Table 2](https://arxiv.org/html/2605.30813#A8.T2.5.2.1.2),[Table 2](https://arxiv.org/html/2605.30813#A8.T2.5.3.2.2)\.

## Appendix AProperizing

### A\.1Semantics Discrepancy

As discussed byBerglund and van der Merwe \([2023](https://arxiv.org/html/2605.30813#bib.bib3)\), Byte Pair Encoding \(BPE\) tokenization generally follows one of two distinct operational semantics:

1. 1\.Standard BPE Semantics\(as defined in[Section3\.1](https://arxiv.org/html/2605.30813#S3.SS1)\): This approach operates on afixed priority schedule\. It iterates through the merge rules from highest to lowest priority, applying each rule exhaustively across the sequence\. Crucially, the process is linear with respect to the rule list: once a priority level is passed, the algorithm never revisits it, regardless of whether subsequent merges create new opportunities for higher\-priority rules\.
2. 2\.SentencePiece Semantics\(Kudo and Richardson,[2018](https://arxiv.org/html/2605.30813#bib.bib9)\): This approach operates on adynamic priority queue\. At each step, it identifies the adjacent token pair with theglobal maximum priorityin the current sequence\. Notably, a merge operation generates at most two new adjacent pairs that may possess even higher priorities\. SentencePiece will immediately execute either of the merges, effectively “growing” the token recursively and prioritizing these local expansions over the previous lower\-priority global candidates\.

To formalize this inconsistency, we adopt the terminology established byBerglund and van der Merwe \([2023](https://arxiv.org/html/2605.30813#bib.bib3)\)\. Animproper dictionaryis defined as one containing merge rules that violate the sequential assumption of the standard BPE\. Specifically, if a rulerlr\_\{l\}produces a token that creates the context for a strictly higher\-priority rulerhr\_\{h\}, the standard BPE will not executerhr\_\{h\}, since the algorithm has strictly proceeded past its priority level\. In contrast, SentencePiece would immediately identifyrhr\_\{h\}as the new global maximum\.

To bridge this inconsistency, we propose a method toproperizethe dictionary: reassigning rule priorities via topological sorting so that the standard BPE’s execution yields results identical to those of SentencePiece\. The method is a sufficient construction for properization by reordering the priorities; other constructions may exist to achieve the same goal\.

### A\.2Dictionary Filtering

In practice, BPE dictionaries often contain redundant or unreachable rules that are never triggered, even under SentencePiece semantics\. To ensure the formalization is well\-defined, we restrict our analysis to valid rules prior to properization\.

Given the deterministic nature of BPE, we define a merge rule\(x,y\)→z\(x,y\)\\to zasvalidif and only if the tokenization of the concatenated stringx​yxyunder SentencePiece semantics results in the single tokenzzitself, and the final applied merge is indeed\(x,y\)\(x,y\)\. Rules failing this condition are effectively unreachable in any context and are thus discarded\.

Crucially, this implies that the mapping from a produced token to its generating rule is unique; no two distinct valid rules result in the same token after filtering\.

In the remainder of this section, we assume the dictionary has been filtered to contain solely valid rules\. Based on this filtered dictionary, we can now define the dependency structure\.

### A\.3Growing Tree

The SentencePiece semantics can be conceptualized as the standard BPE mechanism equipped with additionalgrowing steps\. Under this view, a merge operation\(x0,y0\)→z0\(x\_\{0\},y\_\{0\}\)\\to z\_\{0\}serves as aseed eventthat may trigger a cascade of immediate high\-priority merges due to the greedy strategy\.

Specifically, after the seed merge\(x0,y0\)\(x\_\{0\},y\_\{0\}\), the new tokenz0z\_\{0\}immediately forms new pairs with its left neighbor\(u1,z0\)\(u\_\{1\},z\_\{0\}\)and right neighbor\(z0,v1\)\(z\_\{0\},v\_\{1\}\)\. If either of these new pairs has a strictly higher priority than the original seed\(x0,y0\)\(x\_\{0\},y\_\{0\}\), SentencePiece will immediately execute the one with the higher priority\. Note that executing one side implicitly invalidates the other, as the central tokenz0z\_\{0\}is consumed\.

This recursive process, denoted as\(xk,yk\)→zk\(x\_\{k\},y\_\{k\}\)\\to z\_\{k\}, repeats until no adjacent pair has a higher priority than the original seed\(x0,y0\)\(x\_\{0\},y\_\{0\}\)\. The entire sequence behaves like “growing” outward from the initial seed merge\. Note that, since the length of the merged tokenzkz\_\{k\}strictly increases with each step, no subsequent adjacent pair will possess the exact same priority as the seed\.

Consider all possible scenarios of this process\. The growing path from a seed extends either to the left or to the right, forming a chain of growing steps\. We formalize the collection of all such possible growing steps as theGrowing Tree\. Specifically, the root is the seed rule, and the child nodes are all valid merge rules that can be triggered by the parent during the growing of the root, distinguished by their growing direction \(left or right\)\.

To systematically construct these trees, we analyze the derivation of each valid rule to determine its role \(root or child\)\. For any stringss, letL​\(s\)L\(s\)denote the lowest priority among all merge rules involved in the tokenization ofssunder SentencePiece semantics\. For a valid merge ruler:\(x,y\)→zr:\(x,y\)\\to z, it inherently holds thatL​\(z\)≤min⁡\(L​\(x\),L​\(y\)\)L\(z\)\\leq\\min\(L\(x\),L\(y\)\)\. We classify the rulerrbased on this relationship:

1. 1\.Root Rule\(L​\(z\)<min⁡\(L​\(x\),L​\(y\)\)L\(z\)<\\min\(L\(x\),L\(y\)\)\): The rulerrhas a lower priority than any rule contained within its components\. Thus,rris a seed event that initiates a growing process\. In the Growing Tree,rris a root node\. Functionally, we treat the root asleft\-growing, since the rule becomes applicable only after the tokenyyis generated\.
2. 2\.Growing Step\(L​\(z\)=min⁡\(L​\(x\),L​\(y\)\)L\(z\)=\\min\(L\(x\),L\(y\)\)\): The rulerris triggered by a lower\-priority bottleneck within its components\. We determine the growing direction as follows: - •Left\-Growing: IfL​\(z\)=L​\(y\)L\(z\)=L\(y\), the seed lies withinyy\. Rulerrextends the growing chain to the left\.Crucially, this case covers the scenario whereL​\(x\)=L​\(y\)L\(x\)=L\(y\)\. Under SentencePiece semantics, such ties must be resolved as left\-growing; otherwise, the necessary right\-side tokenyywould not yet be established to participate in the merge\. In the tree,rris a child of the corresponding rule inyy\. - •Right\-Growing: IfL​\(z\)=L​\(x\)L\(z\)=L\(x\)andL​\(z\)<L​\(y\)L\(z\)<L\(y\), the seed lies withinxx\. Rulerrextends the growing chain to the right\. In the tree,rris a child of the corresponding rule inxx\.

Since the dictionary is filtered to ensure unique derivations, each merge rule corresponds to at most one parent, ensuring the dependency structure forms a valid forest\.

Consequently, the growing process of a seed merge under SentencePiece semantics is equivalent to a greedy search for the deepest applicable path within the Growing Tree rooted at that seed\.

### A\.4Dependency Graph and Topological Sort

To simulate SentencePiece semantics using the standard BPE, we need to flatten each Growing Tree into a valid sequential order, by reordering the priorities of rules during growing steps\.

However, the growing steps are applied independently for each seed under SentencePiece semantics, while using the standard BPE requires each rule to be applied to the sequence exhaustively\. We model the relative order and independence of growing steps as a dependency graph\.

1. 1\.Tree Edges \(Parent→\\toChild\): Within each Growing Tree, a parent rule must be processed before its children\. Although the children originally had higher priorities, in the properized dictionary, we must place themafterthe parent to ensure the parent exists to trigger them\.
2. 2\.Sibling Edges \(Among the Direct Children\): For siblings, during the growing steps, only the highest\-prioritized applicable merge is selected\. This implies that sibling rules are evaluated strictly in the order of their original priorities\. Thus, we add edges from the siblings with higher priority to those with lower priority\.
3. 3\.Conflict Edges \(Right\-Growing→\\toLeft\-Growing\): A critical inconsistency arises when two growing steps from different trees compete for the same overlapping token\. Suppose aleft\-growingrule\(w,x\)\(w,x\)and aright\-growingrule\(y,w\)\(y,w\)both require tokenww\. - •Under SentencePiece semantics, each growing process runs independently, so the leftmost seed can obtainwwbefore others\. - •To emulate this deterministically without lookahead, we rely on the left\-to\-right scan nature of the standard BPE\. - •This imposes the constraint thatright\-growing nodes must precede left\-growing nodes\. Hence, edges from the right\-growing node to the left\-growing node obtaining the same token are required\.

One can optimize the number of edges via transitive reduction\. This simplifies the graph without altering its topological properties\.

If the constructed dependency graph is a directed acyclic graph \(DAG\), a topological sort exists\. Deriving the new rule priorities from this topological order yields aproperized dictionary\. The standard BPE, executing this dictionary linearly, will respect all dependencies of the growing steps and produce output identical to SentencePiece\.

### A\.5Sufficiency

###### Theorem A\.1\(Sufficiency\)\.

For any acyclic dependency graph, the derived properized dictionary yields tokenization results under the standard BPE that are identical to those under SentencePiece semantics\.

###### Proof\.

The proof establishes that a topological order exists to reconcile the execution discrepancy between the standard BPE and SentencePiece semantics\.

- •Execution Discrepancy: The standard BPE applies each merge rule globally and exhaustively across the entire sequence\. In contrast, SentencePiece processes seeds sequentially from left to right, completing all recursive growing steps for the current seed before considering the next\.
- •Intra\-tree Consistency: In the absence of conflicts between seeds with the same priority, theTree EdgesandSibling Edgesare sufficient to simulate SentencePiece’s behavior\. These edges ensure that the traversal of the deepest applicable path within each growing tree is executed in the correct dependency order\.
- •Inter\-tree Conflict Resolution: When growing steps from different seeds of the same priority compete for overlapping tokens, SentencePiece inherently prioritizes the leftmost seed\. To simulate this without lookahead, theConflict Edgesensure that right\-growing steps precede left\-growing steps, effectively granting priority to the leftmost seeds within the global scan\.

If the resulting dependency graph is a DAG, the topological sort provides a static priority order that ensures the standard BPE’s global execution results in identical outputs\.

SinceTree EdgesandSibling Edgesdefine a directed forest, they are inherently acyclic\. A cycle would only indicate a conflict between growing processes that cannot be resolved via static priority reordering\. ∎

### A\.6Non\-properizable scenarios

If the graph contains a cycle, the dictionary isnon\-properizable\.

A canonical example is the dictionary\[\(aa, a\), \(a, a\)\]applied to the string “aaaa”\.

- •SentencePiece: yields\[aaa, a\]\.
- •The standard BPE: yields\[aa, aa\], no matter the priorities of the merge rules\.

In the dependency graph,\(a, a\)acts as a root \(left\-growing\) pointing to a right\-growing node\(aa, a\)\. However, theConflict Edgeswould cause a dependency cycle, so it fails to properize\.

In our experiments, one non\-properizable tokenizer we found is Gemma\-3\(Google,[2025b](https://arxiv.org/html/2605.30813#bib.bib27)\), which contains improper merge rules of multiple newline characters\.

## Appendix BProof of the Prefix Consistency

In this section, we provide the formal proof for Lemma[3\.1](https://arxiv.org/html/2605.30813#S3.Thmtheorem1)\. Our proof relies on the inductive properties of the single merge operation visualized in[Figure4](https://arxiv.org/html/2605.30813#A2.F4)\.

#### Notation\.

Let𝕋\(u,v\)\\mathbb\{T\}\_\{\(u,v\)\}denote the operator for a single BPE merge step that replaces all adjacent occurrences of tokens\(u,v\)\(u,v\)from left to right with the merged tokenww\. Initially, the input stringssis mapped to a sequence of atomic tokensφ0\\varphi\_\{0\}\. The full BPE process is a composition of such operators:𝕋=𝕋rm∘⋯∘𝕋r1\\mathbb\{T\}=\\mathbb\{T\}\_\{r\_\{m\}\}\\circ\\dots\\circ\\mathbb\{T\}\_\{r\_\{1\}\}, wherer1,…,rmr\_\{1\},\\dots,r\_\{m\}are the merge rules applied in order\.

###### Proof\.

We prove the lemma by induction on the number of merge steps\.

Base Case \(0 merges\): The initial token sequenceφ0\\varphi\_\{0\}consists of atomic tokens \(bytes/characters\)\. The boundaries are fixed at every character position\. Trivially, any prefix of the atomic token sequence corresponds to the prefix of the raw text\.

Inductive Step: Assume the consistency property holds for the stateφk\\varphi\_\{k\}afterkkmerges\. Consider the\(k\+1\)\(k\+1\)\-th merge ruler=\(u,v\)→wr=\(u,v\)\\to w\. Let the state transition beφk\+1=𝕋r​\(φk\)\\varphi\_\{k\+1\}=\\mathbb\{T\}\_\{r\}\(\\varphi\_\{k\}\)\. Letμ\\mube any prefix of the new token sequenceφk\+1\\varphi\_\{k\+1\}\.

We examine theboundaryat the end ofμ\\mu\. Since𝕋r\\mathbb\{T\}\_\{r\}is aboundary eliminationoperation, it only removes the internal boundaries between instances ofuuandvvsequentially from left to right\. It does not shift or create boundaries\. Therefore, the boundary endingμ\\mumust have existed in the previous stateφk\\varphi\_\{k\}\. This implies there exists a unique prefixη\\etainφk\\varphi\_\{k\}such thatπ​\(η\)=π​\(μ\)\\pi\(\\eta\)=\\pi\(\\mu\)\.

Since𝕋r​\(η\)=μ\\mathbb\{T\}\_\{r\}\(\\eta\)=\\mu\(as the merge operation is local and respects the boundary\), andη\\etais a valid tokenization by hypothesis, it follows thatμ\\muis the valid tokenization ofπ​\(μ\)\\pi\(\\mu\)after the\(k\+1\)\(k\+1\)\-th step\. ∎

![Refer to caption](https://arxiv.org/html/2605.30813v1/x4.png)Figure 4:Inductive step of the Prefix Consistency\.Solid/dashed arrows mark surviving/eliminated boundaries during a merge\(u,v\)\(u,v\)\. For any prefixμ\\muin the post\-merge token sequence, there exists exactly one valid prefixη\\etain the pre\-merge token sequence withπ​\(η\)=π​\(μ\)\\pi\(\\eta\)=\\pi\(\\mu\)\. Notably, applying the single merge step toη\\etayields exactlyμ=𝕋\(u,v\)​\(η\)\\mu=\\mathbb\{T\}\_\{\(u,v\)\}\(\\eta\)\. This proves the lemma by induction over the full tokenization process\.

## Appendix CNormalization and Equivalence Details

In this section, we elaborate on the properties of the normalized representation introduced in[Section3\.3](https://arxiv.org/html/2605.30813#S3.SS3)\.

#### Canonical Generation and Closure\.

For a canonical tokenttproduced eventually by the rule\(x,y\)→t\(x,y\)\\to t, the tokenization of the stringttmust conclude with this specific merge\. This necessitates that the substringsxxandyymust have been independently tokenized intoxxandyyprior to this final merge\. Therefore,𝕋​\(x\)=\[x\]\\mathbb\{T\}\(x\)=\[x\]and𝕋​\(y\)=\[y\]\\mathbb\{T\}\(y\)=\[y\], implying that the predecessorpre⁡\(t\)=x\\operatorname\{pre\}\(t\)=xand successorsuc⁡\(t\)=y\\operatorname\{suc\}\(t\)=yare themselves canonical tokens\. This ensures that𝒱\\mathcal\{V\}is closed under decomposition, which is essential for the construction of the Successor Forest\.

#### Equivalence\.

Normalization does not alter tokenization results for general text\. Since non\-canonical rules are by definition never triggered in the deterministic BPE tokenization of valid tokens, removing them from the dictionary does not affect the merge process for any input string\. Thus,𝕋𝒟​\(s\)≡𝕋D​\(s\)\\mathbb\{T\}\_\{\\mathcal\{D\}\}\(s\)\\equiv\\mathbb\{T\}\_\{D\}\(s\)\.

## Appendix DExtended Intuition of Monotonic Path: Boundary Elimination

In this section, we expand on the intuition provided in[Section4\.1](https://arxiv.org/html/2605.30813#S4.SS1)by reasoning about the merge process of the rightmost token\.

Fix a non\-empty stringssand a suffix tokenttofss\. Lets−suc⁡\(t\)s^\{\-\\operatorname\{suc\}\(t\)\}denote the shorter prefix obtained by removing the suffixsuc⁡\(t\)\\operatorname\{suc\}\(t\)fromss\. Ifttis ever produced as the last token during the merge process, then immediately before applying the canonical rule oftt, the tokenspre⁡\(t\)\\operatorname\{pre\}\(t\)andsuc⁡\(t\)\\operatorname\{suc\}\(t\)must appear consecutively at the end of the current token sequence\.

Crucially, at this specific moment, the boundary betweenpre⁡\(t\)\\operatorname\{pre\}\(t\)andsuc⁡\(t\)\\operatorname\{suc\}\(t\)remains intact, so the presence of the rightmost tokensuc⁡\(t\)\\operatorname\{suc\}\(t\)does not affect any merges that have already occurred to the left ofpre⁡\(t\)\\operatorname\{pre\}\(t\)\. Therefore, the token sequence obtained by removing the finalsuc⁡\(t\)\\operatorname\{suc\}\(t\)coincides with the token sequence produced by the merge process ofs−suc⁡\(t\)s^\{\-\\operatorname\{suc\}\(t\)\}right before applying the rule oftt\.

This implies a necessary condition: ifttis to be formed,pre⁡\(t\)\\operatorname\{pre\}\(t\)must appear as the rightmost token at some stage during the merge process ofs−suc⁡\(t\)s^\{\-\\operatorname\{suc\}\(t\)\}\. In terms of the Successor Forest structure, this means that the eventual last tokenθ​\(s−suc⁡\(t\)\)\\theta\(s^\{\-\\operatorname\{suc\}\(t\)\}\)must be a descendant ofpre⁡\(t\)\\operatorname\{pre\}\(t\)\(orpre⁡\(t\)\\operatorname\{pre\}\(t\)itself\)\.

Additionally, whether this state is reachable depends on rule priorities\. Ifpre⁡\(t\)\\operatorname\{pre\}\(t\)is not yet the final resultθ​\(s−suc⁡\(t\)\)\\theta\(s^\{\-\\operatorname\{suc\}\(t\)\}\), it is destined to be merged with a left neighbor to form a larger token \(specifically, one of its children in the Successor Forest, formed by merging it with its left neighbor\)\. Forttto exist, the canonical rule ofttmust be appliedbeforethis specific merge event occurs\.

In summary, the formation ofttis determined by a single critical moment: whetherpre⁡\(t\)\\operatorname\{pre\}\(t\)can persist as the rightmost token ofs−suc⁡\(t\)s^\{\-\\operatorname\{suc\}\(t\)\}until the canonical rule ofttis applied\.

## Appendix EProof of Monotonic Path Property

In this section, we prove Theorem[4\.2](https://arxiv.org/html/2605.30813#S4.Thmtheorem2), which characterizes the structure of suffix tokens that satisfy the Prefix Last\-Token Condition \(Definition[4\.1](https://arxiv.org/html/2605.30813#S4.Thmtheorem1)\)\.

Recall that for a non\-empty stringss, letkkbe the longest canonical suffix token ofss, and letSufSucTree⁡\(k\)\\operatorname\{SufSucTree\}\(k\)denote the Suffix\-Successor Tree induced by all canonical suffix tokens ofkk\. The theorem asserts that the set of tokens inSufSucTree⁡\(k\)\\operatorname\{SufSucTree\}\(k\)satisfying the Prefix Last\-Token Condition forms exactly the unique path from the true last tokenθ​\(s\)\\theta\(s\)to the root ofSufSucTree⁡\(k\)\\operatorname\{SufSucTree\}\(k\)\.

### E\.1Proof Overview

As discussed in[Section4\.1](https://arxiv.org/html/2605.30813#S4.SS1)and Appendix[D](https://arxiv.org/html/2605.30813#A4), the Prefix Last\-Token Condition \(Definition[4\.1](https://arxiv.org/html/2605.30813#S4.Thmtheorem1)\) can be intuitively understood as the ability of a nodewwto“steal”its predecessorpre⁡\(w\)\\operatorname\{pre\}\(w\)from the last tokenθ​\(s−suc⁡\(w\)\)\\theta\(s^\{\-\\operatorname\{suc\}\(w\)\}\)of the truncated string\.

Claim 1 establishes anupward\-closureproperty: if a node has this stealing capability, then its successor must also have it\. Otherwise, the canonicity of the involved tokens would be violated\.

Claim 2 addresses the question of uniqueness: whether multiple children of the same successor could simultaneously satisfy the condition\. The determinism of the standard BPE merge process, together with strict priority ordering, rules out this possibility\.

Claim 3 confirms that the true last tokenθ​\(s\)\\theta\(s\)indeed satisfies the condition\. Intuitively, sinceθ​\(s\)\\theta\(s\)is produced by the final merge, it necessarily has the ability to stealpre⁡\(θ​\(s\)\)\\operatorname\{pre\}\(\\theta\(s\)\)\.

Finally, Claim 4 shows that the answer nodeθ​\(s\)\\theta\(s\)cannot be further extended: if it could steal once more, it would not be the last token of the final tokenization of the string\.

With these claims, the Monotonic Path Property follows\.

### E\.2Detailed Proof

We begin by introducing auxiliary notation and several structural lemmas that will be used throughout the proof\.

#### Ancestor Path\.

For any canonical tokenx∈𝒱x\\in\\mathcal\{V\}, we defineP​\(x\)P\(x\)to be the unique path fromxxto the root of the Successor Forestℱsuc\\mathcal\{F\}\_\{\\operatorname\{suc\}\}, following successor edges\. Equivalently,P​\(x\)P\(x\)consists ofxxand all of its ancestors in the Successor Forest\.

Since the Successor Forest is acyclic and every non\-atomic token has a unique successor,P​\(x\)P\(x\)is well\-defined and totally ordered by ancestry\.

#### Rule Priority Convention\.

Throughout the proof, we compare merge rules by their execution order in the dictionary: a rulerir\_\{i\}has higher priority thanrjr\_\{j\}\(i\.e\.,ri\>rjr\_\{i\}\>r\_\{j\}\) if and only ifrir\_\{i\}is applied earlier in the standard BPE merge process\.

###### Lemma E\.1\(Priority Monotonicity Along Successor Paths\)\.

Letx,y∈𝒱x,y\\in\\mathcal\{V\}be canonical tokens such thatyyis a descendant ofxxin the Successor Forest \(i\.e\.,x∈P​\(y\)x\\in P\(y\)\)\. If the tokenxxis non\-atomic, then the canonical merge rule producing the tokenxxhas strictly higher priority than the canonical merge rule producing the tokenyy\.

###### Proof\.

By definition of the Successor Forest,yyis obtained by a sequence of canonical merges starting fromxx\. If the canonical rule producingyywere applied before the canonical rule producingxx, thenxxwould never appear as an isolated token during the tokenization of its own string, contradicting the assumption thatxxis canonical\. ∎

###### Lemma E\.2\(Shared Right Boundary\)\.

Letx,y∈𝒱x,y\\in\\mathcal\{V\}be canonical tokens such thatx∈P​\(y\)x\\in P\(y\)\. Consider any tokenization process in which the tokenyyappears as a token in the token sequence\. Immediately before the canonical rule producing the tokenyyis applied, the tokenxxmust also appear as a token, and the right boundary ofxxcoincides with the right boundary ofyy\.

###### Proof\.

Sinceyyis produced by successive canonical merges starting fromxx, all intermediate merges preserve the rightmost boundary until the final merge formingyy\. Thus, prior to the execution of the canonical rule ofyy, the tokenxxmust exist and share the same right endpoint\. ∎

###### Lemma E\.3\(Ancestor Characterization of Last Tokens\)\.

For any non\-empty stringssand any canonical tokenxx, we have

x∈P​\(θ​\(s\)\)⟺xappears as the last token at some stage during the tokenization of the strings\.x\\in P\(\\theta\(s\)\)\\quad\\Longleftrightarrow\\quad\\text\{$x$ appears as the last token at some stage during the tokenization of the string $s$\.\}

###### Proof\.

\(⇒\\Rightarrow\) Ifx∈P​\(θ​\(s\)\)x\\in P\(\\theta\(s\)\), thenθ​\(s\)\\theta\(s\)is obtained fromxxby a sequence of canonical merges\. Immediately before each such merge,xx\(or its descendant\) appears as the last token\.

\(⇐\\Leftarrow\) Ifxxappears as the last token at some stage, then the last tokenθ​\(s\)\\theta\(s\)must be obtained by further merges starting fromxx, implyingx∈P​\(θ​\(s\)\)x\\in P\(\\theta\(s\)\)\. ∎

###### Lemma E\.4\(Tokenization Consistency of Token\-wise Truncation\)\.

The Prefix Consistency \(Lemma[3\.1](https://arxiv.org/html/2605.30813#S3.Thmtheorem1)\) extends naturally to any contiguous subsequence of the token sequence during the tokenization\. It also applies to any prefix of merge rules in a proper dictionary\.

Specifically, for any integerk≤\|𝒟\|k\\leq\|\\mathcal\{D\}\|, after applying the firstkkmerge rules to a string and obtaining the token sequenceφ\\varphi, if we perform token\-wise truncation onφ\\varphi, the isolated contiguous subsequenceμ\\muis exactly the valid tokenization of its corresponding underlying stringπ​\(μ\)\\pi\(\\mu\)with the firstkkmerge rules\.

###### Proof\.

This follows directly from the inductive proof of Lemma[3\.1](https://arxiv.org/html/2605.30813#S3.Thmtheorem1), which applies to each individual merge step\.

From a more abstract perspective, because the boundaries delimitingμ\\musurvive the firstkkmerges, no merge operation has crossed them\. Due to the deterministic, left\-to\-right, non\-overlapping nature of BPE tokenization, the sequence of internal merges strictly withinμ\\muis entirely independent of the tokens outside these boundaries\. Therefore, isolatingμ\\muvia token\-wise truncation yields the exact same state as independently tokenizing the raw stringπ​\(μ\)\\pi\(\\mu\)with the firstkkrules\. ∎

###### Lemma E\.5\(Step\-wise Consistency of Token\-wise Truncation\)\.

Letssbe a string andφ\\varphibe its token sequence after applying the firstkkmerge rules\. Letμ\\mube a contiguous subsequence ofφ\\varphiobtained by token\-wise truncation\. Letrrbe the\(k\+1\)\(k\+1\)\-th merge rule\. Applyingrrtoμ\\mu\(i\.e\.,𝕋r​\(μ\)\\mathbb\{T\}\_\{r\}\(\\mu\)\) yields exactly the valid tokenization of the corresponding substringπ​\(μ\)\\pi\(\\mu\)with the firstk\+1k\+1merge rules\.

###### Proof\.

By Lemma[E\.4](https://arxiv.org/html/2605.30813#A5.Thmtheorem4), token\-wise truncation is valid afterkkmerges\. Thus,μ\\muis exactly the correct token sequence for the substringπ​\(μ\)\\pi\(\\mu\)after the firstkkrules\.

By the definition of the BPE tokenization, the tokenization afterk\+1k\+1rules is obtained by applying the\(k\+1\)\(k\+1\)\-th rulerrdirectly to the tokenization resulting from the firstkkrules\.

Therefore,𝕋r​\(μ\)\\mathbb\{T\}\_\{r\}\(\\mu\)yields the tokenization ofπ​\(μ\)\\pi\(\\mu\)after applying the firstk\+1k\+1merge rules\. ∎

###### Lemma E\.6\(Exclusive Priorities of Tokens satisfying the Prefix Last\-Token Condition\)\.

Letttbe a non\-atomic suffix token of a stringssthat satisfies the Prefix Last\-Token Condition \(Definition[4\.1](https://arxiv.org/html/2605.30813#S4.Thmtheorem1)\)\. During the tokenization process ofss, immediately prior to executing the canonical rule oftt, the rightmost two tokens in the sequence are exactlypre⁡\(t\)\\operatorname\{pre\}\(t\)andsuc⁡\(t\)\\operatorname\{suc\}\(t\)\.

###### Proof\.

For the stringss, let the token boundary immediately to the left of the suffix stringsuc⁡\(t\)\\operatorname\{suc\}\(t\)be◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}, the token boundary immediately to the left of the suffix stringttbe◆t\\lozenge\_\{t\}\. We term the token boundaries strictly between◆t\\lozenge\_\{t\}and◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}as theinternal boundariesofpre⁡\(t\)\\operatorname\{pre\}\(t\)\.

The following proof will focus on the status of◆t\\lozenge\_\{t\},◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}, and the internal boundaries ofpre⁡\(t\)\\operatorname\{pre\}\(t\)\.

#### Step 1: The internal boundaries ofpre⁡\(t\)\\operatorname\{pre\}\(t\)must be eliminated first\.

We first prove that during the merges before applying the canonical rule oftt, no rule can eliminate◆t\\lozenge\_\{t\}or◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}before all token boundaries between them are eliminated\.

Suppose a merge rulerrattempts to eliminate one or more of these boundaries while at least one internal boundary still exists:

- •If the rulerreliminates◆t\\lozenge\_\{t\}: By Lemma[E\.5](https://arxiv.org/html/2605.30813#A5.Thmtheorem5), we can perform token\-wise truncation at◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}immediately before the rulerr, leaving the tokenization of the prefix strings−suc⁡\(t\)s^\{\-\\operatorname\{suc\}\(t\)\}\. The tokenization before and after the rulerrmust be valid fors−suc⁡\(t\)s^\{\-\\operatorname\{suc\}\(t\)\}\. However, with the merge rulerr,◆t\\lozenge\_\{t\}is eliminated before the internal boundaries ofpre⁡\(t\)\\operatorname\{pre\}\(t\)are cleared,pre⁡\(t\)\\operatorname\{pre\}\(t\)can never become an isolated token as the last token during the tokenization\. This leads topre⁡\(t\)∉P​\(θ​\(s−suc⁡\(t\)\)\)\\operatorname\{pre\}\(t\)\\notin P\(\\theta\(s^\{\-\\operatorname\{suc\}\(t\)\}\)\), which contradicts theReachabilitycondition\.
- •If the rulerreliminates◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}while◆t\\lozenge\_\{t\}survives: We truncate at◆t\\lozenge\_\{t\}\. The suffix should form the valid tokenization oftt\. However, the boundary betweenpre⁡\(t\)\\operatorname\{pre\}\(t\)andsuc⁡\(t\)\\operatorname\{suc\}\(t\)is eliminated before the internal boundaries ofpre⁡\(t\)\\operatorname\{pre\}\(t\)are cleared\. This contradicts the proper merge sequence of the canonical tokentt\.

Furthermore, if the internal boundaries exist, there are at least 3 consecutive token boundaries\. Since BPE only merges adjacent pairs without overlapping, the rulerrcan eliminate at most half of the consecutive boundaries\. Thus, it is impossible for all of◆t\\lozenge\_\{t\},◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}, and the internal boundaries ofpre⁡\(t\)\\operatorname\{pre\}\(t\)to be eliminated simultaneously\.

This confirms that any attempt to eliminate◆t\\lozenge\_\{t\}or◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}must happen strictly after the internal boundaries are fully eliminated\.

Furthermore, since the truncation between◆t\\lozenge\_\{t\}and◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}yields the valid tokenization ofpre⁡\(t\)\\operatorname\{pre\}\(t\), and the canonical rule ofpre⁡\(t\)\\operatorname\{pre\}\(t\)has a strictly higher priority than the canonical rule oftt\(orpre⁡\(t\)\\operatorname\{pre\}\(t\)is an atomic token\), it follows that immediately before executing the canonical rule ofttduring the tokenization ofss, both◆t\\lozenge\_\{t\}and◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}must survive without any internal boundaries between them, which meanspre⁡\(t\)\\operatorname\{pre\}\(t\)between◆t\\lozenge\_\{t\}and◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}should become a single isolated token\.

#### Step 2:pre⁡\(t\)\\operatorname\{pre\}\(t\)cannot be consumed before the canonical rule ofttis executed\.

Next, consider the process afterpre⁡\(t\)\\operatorname\{pre\}\(t\)is completely formed but before the canonical rule ofttis executed\.

- •Ifpre⁡\(t\)\\operatorname\{pre\}\(t\)merges leftward \(acting as a successor\): The boundary◆t\\lozenge\_\{t\}is eliminated\. Since a single BPE merge eliminates at most one boundary around each token,◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}must survive\. Truncating at◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}, the prefix corresponds to the tokenization ofs−suc⁡\(t\)s^\{\-\\operatorname\{suc\}\(t\)\}\. This merge impliespre⁡\(t\)\\operatorname\{pre\}\(t\)is merged into a larger token during the tokenization ofs−suc⁡\(t\)s^\{\-\\operatorname\{suc\}\(t\)\}, soθ​\(s−suc⁡\(t\)\)≠pre⁡\(t\)\\theta\(s^\{\-\\operatorname\{suc\}\(t\)\}\)\\neq\\operatorname\{pre\}\(t\)\. By definition, the executed rule is exactly the unique child ofpre⁡\(t\)\\operatorname\{pre\}\(t\)onP​\(θ​\(s−suc⁡\(t\)\)\)P\(\\theta\(s^\{\-\\operatorname\{suc\}\(t\)\}\)\)\. However, thePriority dominancerequires the canonical rule ofttto have a strictly higher priority than this child\. Thus, it is impossible for this merge to occur before the canonical rule oftt, contradicting the assumption\.
- •Ifpre⁡\(t\)\\operatorname\{pre\}\(t\)merges rightward: The merge eliminates◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}while◆t\\lozenge\_\{t\}survives\. Truncating at◆t\\lozenge\_\{t\}, the suffix must represent the valid tokenization oftt\. However, the canonical rule ofttis not yet applied, while the prefix tokenpre⁡\(t\)\\operatorname\{pre\}\(t\)merges rightward with a token other thansuc⁡\(t\)\\operatorname\{suc\}\(t\)\. This contradicts the proper merge sequence of the canonical tokentt\.

#### Step 3: The right side must solely besuc⁡\(t\)\\operatorname\{suc\}\(t\)\.

Finally, immediately prior to executing the canonical rule oftt, suppose there are still other surviving token boundaries between◆suc⁡\(t\)\\lozenge\_\{\\operatorname\{suc\}\(t\)\}and the end of the sequence\. If we truncate at◆t\\lozenge\_\{t\}, the isolated suffix sequence would contain more than two tokens\. However, in the valid tokenization of the canonical tokentt, the token sequence exactly prior to applying its canonical rule must consist of exactly two tokens:pre⁡\(t\)\\operatorname\{pre\}\(t\)andsuc⁡\(t\)\\operatorname\{suc\}\(t\)\. Contradiction\.

#### Conclusion\.

Thus, immediately prior to executing the canonical rule oftt, the token sequence ends exactly withpre⁡\(t\)\\operatorname\{pre\}\(t\)andsuc⁡\(t\)\\operatorname\{suc\}\(t\)\. ∎

With the above notation and lemmas in place, we now prove Theorem[4\.2](https://arxiv.org/html/2605.30813#S4.Thmtheorem2)by establishing four structural claims: upward closure, uniqueness, validity ofθ​\(s\)\\theta\(s\), and maximality\.

#### Claim 1 \(Upward Closure\)\.

Letssbe a non\-empty string and letkkbe the longest canonical suffix token ofss\. For any nodew∈SufSucTree⁡\(k\)w\\in\\operatorname\{SufSucTree\}\(k\), ifwwsatisfies the Prefix Last\-Token Condition \(Definition[4\.1](https://arxiv.org/html/2605.30813#S4.Thmtheorem1)\) with respect toss, then its parentsuc⁡\(w\)\\operatorname\{suc\}\(w\)in the Successor Forest \(if it exists\) also satisfies the Prefix Last\-Token Condition\.

###### Proof\.

Ifwwis an atomic token, it is a root, making the claim vacuous\. Assumewwis non\-atomic\. Letv=suc⁡\(w\)v=\\operatorname\{suc\}\(w\)\. We aim to provevvsatisfies the condition\. Ifvvis an atomic token, it satisfies the condition by definition\. Thus, we further assumevvis non\-atomic\.

#### Notations\.

We define the following components and their corresponding canonical merge rules:

- •u=pre⁡\(w\)u=\\operatorname\{pre\}\(w\)andv=suc⁡\(w\)v=\\operatorname\{suc\}\(w\), withrwr\_\{w\}being the canonical rule\(u,v\)→w\(u,v\)\\to w\.
- •x=pre⁡\(v\)x=\\operatorname\{pre\}\(v\)andy=suc⁡\(v\)y=\\operatorname\{suc\}\(v\), withrvr\_\{v\}being the canonical rule\(x,y\)→v\(x,y\)\\to v\.

Sincewwis a suffix ofssandvvis a suffix ofww, the tokenyyis exactly the suffix token ofss\. Lets−ys^\{\-y\}denote the prefix string obtained by removingyyfromss\.

#### Step 1: Reachability\.

Becausewwsatisfies the condition, Lemma[E\.6](https://arxiv.org/html/2605.30813#A5.Thmtheorem6)guarantees that during the tokenization ofss, immediately prior to executingrwr\_\{w\}, the rightmost two tokens are exactlyuuandvv\. The existence of the fully formed tokenvvrequires that its canonical rulervr\_\{v\}must have been executed at some earlier stage\.

Backtracking to the state immediately prior to the execution ofrvr\_\{v\}, the token sequence must end exactly withxxandyy\. At this specific moment, the token boundary betweenxxandyyexists\. By Lemma[E\.5](https://arxiv.org/html/2605.30813#A5.Thmtheorem5), performing token\-wise truncation at this boundary \(removingyy\) yields the valid tokenization ofs−ys^\{\-y\}at this stage, ending with the tokenxx\. By Lemma[E\.3](https://arxiv.org/html/2605.30813#A5.Thmtheorem3), sincexxappears as the last token during the tokenization ofs−ys^\{\-y\}, we havex∈P​\(θ​\(s−y\)\)x\\in P\(\\theta\(s^\{\-y\}\)\)\. This establishes theReachabilitycondition forvv\.

#### Step 2: Priority Dominance\.

Supposex≠θ​\(s−y\)x\\neq\\theta\(s^\{\-y\}\)\. This implies the finalxxofs−ys^\{\-y\}will eventually be consumed leftward by some merge rulercxr\_\{c\_\{x\}\}to formcxc\_\{x\}\(the unique child ofxxonP​\(θ​\(s−y\)\)P\(\\theta\(s^\{\-y\}\)\)\)\.

Letφ\\varphibe the token sequence ofssimmediately prior to executingrvr\_\{v\}\. As established in Step 1,φ\\varphiends exactly withxxandyy, so we can writeφ=η⊕\[x,y\]\\varphi=\\eta\\oplus\[x,y\]for some prefix sequenceη\\eta\. By Lemma[E\.5](https://arxiv.org/html/2605.30813#A5.Thmtheorem5), performing token\-wise truncation yieldsμ=η⊕\[x\]\\mu=\\eta\\oplus\[x\], which is exactly the token sequence ofs−ys^\{\-y\}at this same stage\.

Since the final tokenxxremains unmerged inμ\\mu, the rulercxr\_\{c\_\{x\}\}has not yet been applied\. Thus,rcxr\_\{c\_\{x\}\}cannot have a higher priority thanrvr\_\{v\}\.

Suppose their priorities are equal, meaningrcxr\_\{c\_\{x\}\}is exactlyrvr\_\{v\}, i\.e\.,cx=vc\_\{x\}=v\. Forrvr\_\{v\}to consumexxleftward, given its definition\(x,y\)→v\(x,y\)\\to v, we must havex=yx=y\. This makes the rulerv=\(x,x\)→vr\_\{v\}=\(x,x\)\\to v\. Asrcxr\_\{c\_\{x\}\}must be applied and alter the last token of𝕋rcx​\(μ\)\\mathbb\{T\}\_\{r\_\{c\_\{x\}\}\}\(\\mu\)tovv, the token sequences can be written asμ=η′⊕\[x,x\]\\mu=\\eta^\{\\prime\}\\oplus\[x,x\]andφ=η′⊕\[x,x,x\]\\varphi=\\eta^\{\\prime\}\\oplus\[x,x,x\]\. Now consider the deterministic left\-to\-right matching of the rule\(x,x\)→v\(x,x\)\\to v\. Either𝕋rv​\(μ\)\\mathbb\{T\}\_\{r\_\{v\}\}\(\\mu\)or𝕋rv​\(φ\)\\mathbb\{T\}\_\{r\_\{v\}\}\(\\varphi\)exactly ends with the tokenvv, while the other ends with the tokenxx, contradicting the analysis where both must end withvv\.

Therefore, the canonical rulervr\_\{v\}must have a strictly higher priority thanrcxr\_\{c\_\{x\}\}, satisfying thePriority dominancecondition for tokenvvwith respect toss\.

#### Conclusion\.

With both conditions verified,suc⁡\(w\)=v\\operatorname\{suc\}\(w\)=vsatisfies the Prefix Last\-Token Condition with respect toss\. ∎

#### Claim 2 \(Uniqueness of the satisfying child\)\.

Letssbe a non\-empty string\. For any nodevv, there existsat most onechildwwofvvthat satisfies the Prefix Last\-Token Condition \(Definition[4\.1](https://arxiv.org/html/2605.30813#S4.Thmtheorem1)\) with respect toss\.

###### Proof\.

Assume for contradiction that there exist two distinct nodeswa≠wbw\_\{a\}\\neq w\_\{b\}that both satisfy the condition with respect toss, and share the same successorv=suc⁡\(wa\)=suc⁡\(wb\)v=\\operatorname\{suc\}\(w\_\{a\}\)=\\operatorname\{suc\}\(w\_\{b\}\)\. Letua=pre⁡\(wa\)u\_\{a\}=\\operatorname\{pre\}\(w\_\{a\}\)andub=pre⁡\(wb\)u\_\{b\}=\\operatorname\{pre\}\(w\_\{b\}\)\. Letrwa=\(ua,v\)r\_\{w\_\{a\}\}=\(u\_\{a\},v\)andrwb=\(ub,v\)r\_\{w\_\{b\}\}=\(u\_\{b\},v\)\. Sincewa≠wbw\_\{a\}\\neq w\_\{b\}, we haveua≠ubu\_\{a\}\\neq u\_\{b\}\.

By theReachabilitycondition, bothua,ub∈P​\(θ​\(s−v\)\)u\_\{a\},u\_\{b\}\\in P\(\\theta\(s^\{\-v\}\)\)\. SinceP​\(θ​\(s−v\)\)P\(\\theta\(s^\{\-v\}\)\)is a chain on a tree, one must be an ancestor of the other\. Without loss of generality, assumeuau\_\{a\}is a proper ancestor ofubu\_\{b\}\. Thusua≠θ​\(s−v\)u\_\{a\}\\neq\\theta\(s^\{\-v\}\)\.

This implies that during the tokenization ofs−vs^\{\-v\}, the tokenuau\_\{a\}is formed earlier thanubu\_\{b\}and is required to formubu\_\{b\}\. Forubu\_\{b\}to eventually form,uau\_\{a\}must uniquely merge leftward via some rulerca=\(k,ua\)→car\_\{c\_\{a\}\}=\(k,u\_\{a\}\)\\to c\_\{a\}, wherecac\_\{a\}is the unique child ofuau\_\{a\}onP​\(ub\)P\(u\_\{b\}\)\. Sinceua,ub∈P​\(θ​\(s−v\)\)u\_\{a\},u\_\{b\}\\in P\(\\theta\(s^\{\-v\}\)\),cac\_\{a\}is also the unique child ofuau\_\{a\}onP​\(θ​\(s−v\)\)P\(\\theta\(s^\{\-v\}\)\)\.

By thePriority dominancecondition forwaw\_\{a\}, we haverwa\>rcar\_\{w\_\{a\}\}\>r\_\{c\_\{a\}\}andwa≠caw\_\{a\}\\neq c\_\{a\}\.

Becausewaw\_\{a\}satisfies the condition, Lemma[E\.6](https://arxiv.org/html/2605.30813#A5.Thmtheorem6)guarantees that during the tokenization ofss, immediately prior to executing the rulerwar\_\{w\_\{a\}\}, the rightmost two tokens are exactlyuau\_\{a\}andvv\. Likewise, immediately prior to executing the rulerwbr\_\{w\_\{b\}\}, the rightmost two tokens are exactlyubu\_\{b\}andvv\.

Becauseuau\_\{a\}is a proper ancestor ofubu\_\{b\}, during the forward tokenization ofss, the state whereuau\_\{a\}andvvare the rightmost two tokens must strictly precede the state whereubu\_\{b\}andvvare the rightmost two tokens\.

During the tokenization ofss, when executing the rulerwar\_\{w\_\{a\}\}, the token boundary immediately to the left of the last tokenvvmust survive the merge; otherwise, the last token would never bevvimmediately prior to executing the rulerwbr\_\{w\_\{b\}\}later on\. This means that the execution ofrwar\_\{w\_\{a\}\}leaves a mergeable adjacent token pair\[ua,v\]\[u\_\{a\},v\]intact at the end\. However, by the left\-to\-right nature of BPE, this will only occur when the predecessor and the successor of the merge rule are the same, and there is a sequence of at least three such identical tokens being adjacent\.Thus, we must haveua=vu\_\{a\}=vandwaw\_\{a\}is a child ofuau\_\{a\}\.

Hence, immediately prior to executing the rulerwar\_\{w\_\{a\}\}during the tokenization ofss, the rightmost three tokens in the sequence are exactly the same: the tokenvv\. After executing the rulerwar\_\{w\_\{a\}\}, the rightmost two tokens become exactlywaw\_\{a\}andvv\. Consider the token\-wise truncation removing the last tokenvvto obtain the valid tokenization ofs−vs^\{\-v\}\. At this specific stage, we havewaw\_\{a\}as the last token of the tokenization ofs−vs^\{\-v\}, i\.e\.,wa∈P​\(θ​\(s−v\)\)w\_\{a\}\\in P\(\\theta\(s^\{\-v\}\)\)\.

Sincewaw\_\{a\}is a child ofuau\_\{a\},waw\_\{a\}is the unique child ofuau\_\{a\}onP​\(θ​\(s−v\)\)P\(\\theta\(s^\{\-v\}\)\), contradictingwa≠caw\_\{a\}\\neq c\_\{a\}\.

Therefore, the assumption is false\. ∎

#### Claim 3 \(The answer node satisfies the condition\)\.

Letw=θ​\(s\)w=\\theta\(s\)be the last token for a non\-empty stringssafter tokenization\. Thenwwsatisfies the Prefix Last\-Token Condition \(Definition[4\.1](https://arxiv.org/html/2605.30813#S4.Thmtheorem1)\) with respect toss\.

###### Proof\.

Ifwwis an atomic token, it satisfies the condition by definition\. Assumewwis non\-atomic\. Letu=pre⁡\(w\)u=\\operatorname\{pre\}\(w\)andv=suc⁡\(w\)v=\\operatorname\{suc\}\(w\), withrwr\_\{w\}being the canonical merge rule\(u,v\)→w\(u,v\)\\to w\.

Sincew=θ​\(s\)w=\\theta\(s\), the tokenization ofsseventually produceswwas its last token\. This implies that immediately prior to the execution ofrwr\_\{w\}, the token sequence ofssmust end exactly withuuandvv\.

#### Step 1: Reachability\.

At this exact moment, the boundary betweenuuandvvexists\. By Lemma[E\.5](https://arxiv.org/html/2605.30813#A5.Thmtheorem5), performing token\-wise truncation at this boundary \(removingvv\) yields the valid tokenization ofs−vs^\{\-v\}at this stage, which ends with the tokenuu\. By Lemma[E\.3](https://arxiv.org/html/2605.30813#A5.Thmtheorem3), sinceuuappears as the last token during the tokenization ofs−vs^\{\-v\}, we haveu∈P​\(θ​\(s−v\)\)u\\in P\(\\theta\(s^\{\-v\}\)\)\. This establishes theReachabilitycondition forww\.

#### Step 2: Priority Dominance\.

Supposeu≠θ​\(s−v\)u\\neq\\theta\(s^\{\-v\}\)\. Then the last tokenuuduring the tokenization ofs−vs^\{\-v\}must eventually be consumed leftward by some merge rulercu=\(k,u\)→cur\_\{c\_\{u\}\}=\(k,u\)\\to c\_\{u\}, wherecuc\_\{u\}is the unique child ofuuonP​\(θ​\(s−v\)\)P\(\\theta\(s^\{\-v\}\)\)\.

Sinceuuappears as the last token, during the tokenization ofs−vs^\{\-v\}immediately prior to the execution ofrwr\_\{w\}, the rulercur\_\{c\_\{u\}\}has not yet been applied, meaning its priority cannot be strictly higher thanrwr\_\{w\}, i\.e\.,rcu≤rwr\_\{c\_\{u\}\}\\leq r\_\{w\}\.

Suppose their priorities are equal, i\.e\.,cu=wc\_\{u\}=w\. Sincerw=\(u,v\)r\_\{w\}=\(u,v\)andrcu=\(k,u\)r\_\{c\_\{u\}\}=\(k,u\), we must haveu=vu=v, making the rulerw=\(u,u\)→wr\_\{w\}=\(u,u\)\\to w\.

Sinceu≠θ​\(s−v\)u\\neq\\theta\(s^\{\-v\}\)andcuc\_\{u\}is the unique child ofuuonP​\(θ​\(s−v\)\)P\(\\theta\(s^\{\-v\}\)\), during the tokenization ofs−vs^\{\-v\}when applyingrcu=rwr\_\{c\_\{u\}\}=r\_\{w\}, the last token must be altered towwin order to formθ​\(s−v\)\\theta\(s^\{\-v\}\)\. However, sincew=θ​\(s\)w=\\theta\(s\), during the tokenization ofsswhen applyingrwr\_\{w\}, the last token must be altered toww\. Similar to the proof of Step 2 in Claim 1, this will be impossible\. Thuscu≠wc\_\{u\}\\neq wandrcu<rwr\_\{c\_\{u\}\}<r\_\{w\}\.

This establishes thePriority Dominancecondition forww\. ∎

#### Claim 4 \(The answer node has no children satisfying the condition\)\.

Letssbe a non\-empty string\. The answer nodeθ​\(s\)\\theta\(s\)has no children that satisfy the Prefix Last\-Token Condition \(Definition[4\.1](https://arxiv.org/html/2605.30813#S4.Thmtheorem1)\) with respect toss\.

###### Proof\.

Letv=θ​\(s\)v=\\theta\(s\)\. Assume for contradiction that there exists a child nodewwofvv\(meaningv=suc⁡\(w\)v=\\operatorname\{suc\}\(w\)\) that satisfies the condition with respect toss\. Letu=pre⁡\(w\)u=\\operatorname\{pre\}\(w\), and letrwr\_\{w\}be the canonical merge rule\(u,v\)→w\(u,v\)\\to w\.

Becausev=θ​\(s\)v=\\theta\(s\), the Prefix Consistency \(Lemma[3\.1](https://arxiv.org/html/2605.30813#S3.Thmtheorem1)\) guarantees that the final tokenization ofssis exactly the tokenization ofs−vs^\{\-v\}appended withvv, i\.e\.,𝕋​\(s\)=𝕋​\(s−v\)⊕\[v\]\\mathbb\{T\}\(s\)=\\mathbb\{T\}\(s^\{\-v\}\)\\oplus\[v\]\.

By theReachabilitycondition forww, we haveu∈P​\(θ​\(s−v\)\)u\\in P\(\\theta\(s^\{\-v\}\)\)\. We consider the two possible cases foruu:

#### Case 1:u=θ​\(s−v\)u=\\theta\(s^\{\-v\}\)\.

Ifuuis exactly the last token ofs−vs^\{\-v\}, then𝕋​\(s−v\)\\mathbb\{T\}\(s^\{\-v\}\)ends withuu\. Consequently, the final tokenization𝕋​\(s\)\\mathbb\{T\}\(s\)must end exactly with the adjacent pair of tokens\[u,v\]\[u,v\]\. However, sincerw=\(u,v\)→wr\_\{w\}=\(u,v\)\\to wis a valid merge rule in the vocabulary, the deterministic BPE process would not terminate leaving the matchable pair\[u,v\]\[u,v\]intact\. Ifu≠vu\\neq v, the tokenization ofsswill not terminate with the isolated tokenvvas the last token\. Even ifu=vu=v, in the final tokenization𝕋​\(s\)\\mathbb\{T\}\(s\), the rightmost two tokens would at most bewwandvv, rather than two adjacentuu’s, meaning the tokenization𝕋​\(s−v\)\\mathbb\{T\}\(s^\{\-v\}\)would havewwas the last token\. Hence,uuwill not beθ​\(s−v\)\\theta\(s^\{\-v\}\)\.

#### Case 2:u≠θ​\(s−v\)u\\neq\\theta\(s^\{\-v\}\)\.

Ifuuis not the last token ofs−vs^\{\-v\}, it must eventually be consumed leftward by some merge rulercu=\(k,u\)→cur\_\{c\_\{u\}\}=\(k,u\)\\to c\_\{u\}, wherecuc\_\{u\}is the unique child ofuuonP​\(θ​\(s−v\)\)P\(\\theta\(s^\{\-v\}\)\)\. Becausewwsatisfies the condition, thePriority dominancestrictly dictates thatrw\>rcur\_\{w\}\>r\_\{c\_\{u\}\}\.

During the tokenization of the stringss, sincev=θ​\(s\)v=\\theta\(s\), the suffix tokenvvmust never be consumed by any rightward merges after its formation\. So the token boundary immediately to the left of the suffixvvis never eliminated\. By Lemma[E\.4](https://arxiv.org/html/2605.30813#A5.Thmtheorem4), we can truncate the tokenization ofssremoving the suffix corresponding to the stringvvand obtain the valid tokenization ofs−vs^\{\-v\}after the execution of each merge rule\.

Consider the tokenizations of boths−vs^\{\-v\}andssimmediately prior to executingrwr\_\{w\}, where the tokensuuandvvhave been formed, as they are the predecessor and the successor ofww\. At this stage,rcur\_\{c\_\{u\}\}has not been executed sincerw\>rcur\_\{w\}\>r\_\{c\_\{u\}\}\. Hence, at this moment, the last token of the tokenization ofs−vs^\{\-v\}isuu, and the last token of the tokenization ofssisvv\. So the rightmost two tokens of the tokenization ofssareuuandvv\.

However, sincev=θ​\(s\)v=\\theta\(s\), the execution ofrwr\_\{w\}must skip the rightmost adjacentuuandvv, which is only possible whenu=vu=v\. Similar to the proof of Step 2 in Claim 1, this will also be impossible, as it leads to the case wherew=cuw=c\_\{u\}, directly contradicting the requirement ofPriority dominance\(rw\>rcur\_\{w\}\>r\_\{c\_\{u\}\}\)\.

#### Conclusion\.

Both cases yield a direct contradiction\. Therefore, the assumption is false, and the answer nodeθ​\(s\)\\theta\(s\)has no children satisfying the condition\. ∎

#### Conclusion of the Proof\.

We now conclude the proof of Theorem[4\.2](https://arxiv.org/html/2605.30813#S4.Thmtheorem2)\. Claim 1 establishes that the Prefix Last\-Token Condition is upward closed along successor edges\. Claim 2 shows that for any node, at most one child can satisfy the condition\. Claim 3 guarantees that the true last tokenθ​\(s\)\\theta\(s\)does satisfy the condition, while Claim 4 shows that it is maximal and admits no further extension\.

Together, these four claims imply that the set of tokens inSufSucTree⁡\(t\)\\operatorname\{SufSucTree\}\(t\)satisfying the Prefix Last\-Token Condition forms exactly a single, maximal path fromθ​\(s\)\\theta\(s\)to the root of the tree\. This completes the proof\.

## Appendix FSquare\-Root Tiled Transition Table

In an Aho–Corasick automaton\(Aho and Corasick,[1975](https://arxiv.org/html/2605.30813#bib.bib4)\), a straightforward implementation stores a full transition array over the alphabetΣ\\Sigmafor every state, yielding constant\-time transitions at the cost of a large memory footprint\.

A key observation is that, for any state, its transition table differs from that of its suffix link only at entries corresponding to outgoing trie edges\. This structural locality allows transitions to be shared persistently across states\.

We exploit this property using a square\-root tiling of the alphabet\. The alphabet is partitioned into fixed\-size tiles; each state stores references to its tiles, and a tile is duplicated only when a transition within it is modified\. As a result, most tiles are shared between a state and its suffix link, while updates remain local to the affected tiles\.

This representation preserves𝒪​\(1\)\\mathcal\{O\}\(1\)transition queries and supports efficient construction, while substantially reducing memory usage in practice\.

## Appendix GEager Output Algorithm Details

Based on the Active Frontier definition in[Section6\.1](https://arxiv.org/html/2605.30813#S6.SS1), we maintain the subgraph of the Prefix Tree of Tokens composed of the nodes in𝒫\\mathcal\{P\}and their ancestors to identify the common ancestral path\. The algorithm proceeds as follows:

1. 1\.Window Maintenance: We maintain𝒫\\mathcal\{P\}using two pointers\. The right pointer adds the newθ​\(s\)\\theta\(s\)\. The left pointer advances to satisfy the bound\|s\|−d​\(s\)\|s\|\-d\(s\), removing expired tokens, where the functionddis defined in[Section6\.1](https://arxiv.org/html/2605.30813#S6.SS1)\.
2. 2\.Subgraph Tracking: We maintain the nodes in𝒫\\mathcal\{P\}and their ancestors as a dynamic subgraph\. For each node in this subgraph, we record the number of its children that are also part of the subgraph\.
3. 3\.Eager Emission: We maintain the children of the virtual root\. If the virtual root has exactlyone childccin the subgraph, it implies that all Parental Candidates \(and thus all possible futures\) are descendants of the nodecc\. The token represented byccis therefore stable\. We emit the corresponding token, update the virtual root tocc, and repeat the check\.

Each node in the Prefix Tree of Tokens is inserted into the maintained subgraph exactly once, at the moment when the right pointer includes the corresponding last token into𝒫\\mathcal\{P\}\. A node may persist in the subgraph even after it ceases to belong to𝒫\\mathcal\{P\}, but it will be removed at most once, when it no longer lies on the ancestral path of any active Parental Candidate\.

During the advancement of the left pointer, each step removes either one node from the subgraph or performs a constant amount of bookkeeping without removal\. Consequently, each node contributes𝒪​\(1\)\\mathcal\{O\}\(1\)work over the entire execution of the algorithm\.

Therefore, the eager output mechanism incurs an amortized𝒪​\(1\)\\mathcal\{O\}\(1\)time overhead per input update\.

## Appendix HDetailed Benchmarks

### H\.1Experiment Setup

All benchmarks were conducted on a bare\-metal server equipped with anIntel®Xeon®Platinum 8362 CPU @ 2\.80GHz\. To ensure reproducibility and minimize measurement noise, we applied the following configurations:

- •CPU Isolation: Experiments were pinned to a single NUMA node \(Node 1\) containing 32 physical cores and 128 GB of RAM\.
- •System Tuning: Hyper\-threading \(SMT\), Turbo Boost, and Address Space Layout Randomization \(ASLR\) were disabled\.
- •Concurrency: Tests were executed using 30 parallel workers\. Each worker ran as an isolated, single\-threaded process that independently loaded the tokenizer and dataset\.

To ensure the validity of our results, we verified the correctness of our implementations against baseline methods prior to performance benchmarking\. For all benchmarks \(tokenizersandtiktokenbindings\), we changed the global allocator tomimalloc\(Leijenet al\.,[2019](https://arxiv.org/html/2605.30813#bib.bib39); Purplecoin,[2026](https://arxiv.org/html/2605.30813#bib.bib38)\)and fixed seeds for hash functions to guarantee stable and reproducible latency measurements\.

### H\.2Datasets

We constructed three representative datasets using specific subsets of publicly available datasets\.[Table2](https://arxiv.org/html/2605.30813#A8.T2)provides a summary of the dataset statistics\.

Table 2:Datasets used in experiments\.The specific sampling strategies were as follows:

- •English & Chinese: We utilized the first Parquet shard from the standard training split of the Wikipedia dataset \(20231101 dump\), and employed a strided sampling strategy, extracting every 42nd document for English and every 60th document for Chinese\.
- •Code: We extracted data from the RedPajama GitHub subset\(Together Computer,[2023a](https://arxiv.org/html/2605.30813#bib.bib35)\), specifically using the file with the filename “filtered\_08cdfa755e6d4d89b673d5bd1acee5f6\.sampled\.jsonl”\. We selected the first 2,500 documents from this file\.

Additionally, for pathological input analysis, we constructed strings consisting of2k2^\{k\}repetitions of the character ‘a’\.

### H\.3Tokenizers

We evaluated a diverse set of tokenizers supported by two major libraries: Hugging Face’stokenizersand OpenAI’stiktoken\.

Table 3:Tokenizers used in experiments for Hugging Face’stokenizers\.†Improper dictionary\. See Appendix[A\.6](https://arxiv.org/html/2605.30813#A1.SS6)for non\-properizable cases\.\*Properized dictionary\. See Appendix[A](https://arxiv.org/html/2605.30813#A1)for properization\.

#### Hugging Face Models\.

[Table3](https://arxiv.org/html/2605.30813#A8.T3)summarizes the open\-source tokenizers selected for our experiments\. These models exhibit varying pre\-tokenization strategies:

- •Regex\-Based: The majority of the models \(DeepSeek\-3\.2, GPT\-OSS, Llama\-3\.1, Llama\-4, Mistral\-3, and Qwen\-3\) employ regex\-based rules to split input text before tokenization\.
- •Rule\-Based: Ouro utilizes a digit\-based pre\-tokenization, and Gemma\-3 employs a rule on space characters\.
- •None: Notably, CodeLlama appliesno pre\-tokenizationlogic\.

Regarding the “properness” of dictionaries, the improper merge rules of Llama\-3\.1 are properized for our benchmarks \(see Appendix[A](https://arxiv.org/html/2605.30813#A1)\)\. In contrast, Gemma\-3’s merge rules are improper and cannot be properized \(see Appendix[A\.6](https://arxiv.org/html/2605.30813#A1.SS6)\)\.

Table 4:Tokenizers used in experiments for OpenAI’stiktoken\.
#### OpenAI Models\.

For thetiktokenlibrary, we utilized the standard built\-in encoding models\.[Table4](https://arxiv.org/html/2605.30813#A8.T4)lists the specific models that we used\. Note that all testedtiktokenmodels employ regex\-based pre\-tokenization\.

### H\.4Detailed Results

We report throughput in terms of MiB/s based on the median of multiple execution runs\. For each benchmark setting, throughput is computed as the total number of input bytes divided by the sum of the per\-document median execution times\. This aggregation reflects the effective end\-to\-end processing rate under repeated, independent tokenization workloads\.

We evaluate two input regimes, indicated by theEvalcolumn inLABEL:tab:detailed\-benchmarks\. In theconcatsetting, all documents in a dataset are concatenated into a single input stream, separated by newline characters, and tokenized as one continuous sequence\. In theper\-docsetting, each document is provided to the tokenizer as an independent input, and execution times are measured separately per document before aggregation\. The results reported in the main text \([Table1](https://arxiv.org/html/2605.30813#S7.T1)\) correspond to theconcatsetting, whileLABEL:tab:detailed\-benchmarksprovides the full breakdown across both evaluation regimes\.

Table 5:Detailed throughput benchmarks\.Base: Original built\-in BPE implementation\.Inc: Our incremental BPE\.Eager: “Inc” with eager output\.I/B: Relative speedup ofInc/Base\\text\{Inc\}/\\text\{Base\}\.DatasetEvalTokenizerCacheBaseIncEagerI/B\(MiB/s\)\(MiB/s\)\(MiB/s\)\(×\\times\)Hugging Face’stokenizersEnglishconcatCodeLlama✔0\.800\.802\.492\.492\.232\.233\.13EnglishconcatCodeLlama✘0\.800\.802\.452\.452\.192\.193\.05EnglishconcatDeepSeek\-3\.2✔1\.521\.521\.541\.541\.511\.511\.01EnglishconcatDeepSeek\-3\.2✘1\.421\.421\.481\.481\.441\.441\.04EnglishconcatGPT\-OSS✔1\.891\.891\.901\.901\.881\.881\.001\.00EnglishconcatGPT\-OSS✘1\.891\.891\.921\.921\.891\.891\.01EnglishconcatLlama\-3\.1✔1\.841\.841\.821\.821\.791\.790\.990\.99EnglishconcatLlama\-3\.1✘1\.871\.871\.871\.871\.851\.851\.001\.00EnglishconcatLlama\-4✔1\.841\.841\.861\.861\.841\.841\.01EnglishconcatLlama\-4✘1\.861\.861\.881\.881\.851\.851\.01EnglishconcatMistral\-3✔1\.831\.831\.831\.831\.801\.801\.001\.00EnglishconcatMistral\-3✘1\.841\.841\.831\.831\.801\.800\.990\.99EnglishconcatOuro✔1\.591\.591\.591\.591\.561\.561\.001\.00EnglishconcatOuro✘1\.511\.511\.571\.571\.501\.501\.04EnglishconcatQwen\-3✔1\.521\.521\.591\.591\.541\.541\.05EnglishconcatQwen\-3✘1\.341\.341\.501\.501\.451\.451\.12OpenAI’stiktokenEnglishconcatCL100K—11\.1711\.1710\.7810\.7810\.0310\.030\.960\.96EnglishconcatO200K—6\.216\.216\.126\.125\.985\.980\.990\.99EnglishconcatP50K—12\.8612\.8612\.4212\.4211\.4511\.450\.970\.97EnglishconcatR50K—12\.8312\.8312\.3812\.3811\.4911\.490\.960\.96Hugging Face’stokenizersEnglishper\-docCodeLlama✔2\.592\.593\.873\.873\.203\.201\.49Englishper\-docCodeLlama✘2\.592\.593\.893\.893\.203\.201\.50Englishper\-docDeepSeek\-3\.2✔2\.042\.042\.152\.152\.092\.091\.05Englishper\-docDeepSeek\-3\.2✘1\.831\.832\.042\.041\.911\.911\.11Englishper\-docGPT\-OSS✔2\.782\.782\.802\.802\.732\.731\.01Englishper\-docGPT\-OSS✘2\.772\.772\.812\.812\.752\.751\.01Englishper\-docLlama\-3\.1✔2\.742\.742\.782\.782\.712\.711\.01Englishper\-docLlama\-3\.1✘2\.782\.782\.812\.812\.722\.721\.01Englishper\-docLlama\-4✔2\.672\.672\.742\.742\.662\.661\.03Englishper\-docLlama\-4✘2\.662\.662\.772\.772\.742\.741\.04Englishper\-docMistral\-3✔2\.672\.672\.732\.732\.652\.651\.02Englishper\-docMistral\-3✘2\.712\.712\.792\.792\.712\.711\.03Englishper\-docOuro✔2\.262\.262\.292\.292\.212\.211\.01Englishper\-docOuro✘2\.052\.052\.182\.182\.062\.061\.07Englishper\-docQwen\-3✔1\.991\.992\.202\.202\.122\.121\.10Englishper\-docQwen\-3✘1\.671\.672\.082\.081\.951\.951\.25OpenAI’stiktokenEnglishper\-docCL100K—11\.5411\.5411\.1911\.1910\.3510\.350\.970\.97Englishper\-docO200K—6\.306\.306\.236\.235\.995\.990\.990\.99Englishper\-docP50K—13\.4313\.4312\.9812\.9811\.9811\.980\.970\.97Englishper\-docR50K—13\.4313\.4312\.9912\.9911\.9011\.900\.970\.97Hugging Face’stokenizersChineseconcatCodeLlama✔2\.332\.332\.552\.552\.472\.471\.10ChineseconcatCodeLlama✘2\.332\.332\.542\.542\.462\.461\.09ChineseconcatDeepSeek\-3\.2✔1\.521\.521\.411\.411\.321\.320\.930\.93ChineseconcatDeepSeek\-3\.2✘1\.551\.551\.411\.411\.331\.330\.910\.91ChineseconcatGPT\-OSS✔1\.711\.711\.841\.841\.731\.731\.08ChineseconcatGPT\-OSS✘1\.731\.731\.851\.851\.741\.741\.07ChineseconcatLlama\-3\.1✔1\.791\.791\.861\.861\.751\.751\.03ChineseconcatLlama\-3\.1✘1\.841\.841\.901\.901\.791\.791\.03ChineseconcatLlama\-4✔1\.731\.731\.751\.751\.631\.631\.01ChineseconcatLlama\-4✘1\.751\.751\.791\.791\.671\.671\.02ChineseconcatMistral\-3✔1\.731\.731\.801\.801\.711\.711\.04ChineseconcatMistral\-3✘1\.751\.751\.821\.821\.721\.721\.04ChineseconcatOuro✔1\.481\.481\.511\.511\.481\.481\.02ChineseconcatOuro✘1\.511\.511\.531\.531\.491\.491\.01ChineseconcatQwen\-3✔1\.591\.591\.651\.651\.541\.541\.04ChineseconcatQwen\-3✘1\.581\.581\.651\.651\.551\.551\.04OpenAI’stiktokenChineseconcatCL100K—9\.369\.3614\.9214\.9211\.7111\.711\.59ChineseconcatO200K—7\.107\.1010\.3910\.398\.638\.631\.46ChineseconcatP50K—10\.0310\.0313\.5713\.5710\.9810\.981\.35ChineseconcatR50K—10\.0110\.0113\.5513\.5510\.9510\.951\.35Hugging Face’stokenizersChineseper\-docCodeLlama✔4\.304\.304\.694\.694\.374\.371\.09Chineseper\-docCodeLlama✘4\.334\.334\.714\.714\.454\.451\.09Chineseper\-docDeepSeek\-3\.2✔1\.941\.941\.761\.761\.601\.600\.910\.91Chineseper\-docDeepSeek\-3\.2✘1\.961\.961\.771\.771\.621\.620\.910\.91Chineseper\-docGPT\-OSS✔2\.252\.252\.412\.412\.222\.221\.07Chineseper\-docGPT\-OSS✘2\.282\.282\.462\.462\.252\.251\.08Chineseper\-docLlama\-3\.1✔2\.472\.472\.542\.542\.312\.311\.03Chineseper\-docLlama\-3\.1✘2\.512\.512\.582\.582\.342\.341\.03Chineseper\-docLlama\-4✔2\.202\.202\.282\.282\.062\.061\.04Chineseper\-docLlama\-4✘2\.262\.262\.302\.302\.092\.091\.02Chineseper\-docMistral\-3✔2\.312\.312\.442\.442\.222\.221\.06Chineseper\-docMistral\-3✘2\.342\.342\.482\.482\.272\.271\.06Chineseper\-docOuro✔2\.192\.192\.252\.252\.172\.171\.03Chineseper\-docOuro✘2\.232\.232\.302\.302\.202\.201\.03Chineseper\-docQwen\-3✔2\.002\.002\.152\.151\.931\.931\.08Chineseper\-docQwen\-3✘1\.971\.972\.152\.151\.941\.941\.09OpenAI’stiktokenChineseper\-docCL100K—9\.759\.7516\.0016\.0012\.3712\.371\.64Chineseper\-docO200K—7\.377\.3711\.0911\.099\.099\.091\.50Chineseper\-docP50K—10\.5610\.5614\.5014\.5011\.5211\.521\.37Chineseper\-docR50K—10\.5510\.5514\.4514\.4511\.4611\.461\.37Hugging Face’stokenizersCodeconcatCodeLlama✔0\.820\.822\.372\.372\.212\.212\.88CodeconcatCodeLlama✘0\.810\.812\.342\.342\.172\.172\.87CodeconcatDeepSeek\-3\.2✔1\.521\.521\.561\.561\.541\.541\.03CodeconcatDeepSeek\-3\.2✘1\.451\.451\.541\.541\.481\.481\.07CodeconcatGPT\-OSS✔1\.861\.861\.881\.881\.871\.871\.01CodeconcatGPT\-OSS✘1\.881\.881\.881\.881\.871\.871\.001\.00CodeconcatLlama\-3\.1✔1\.781\.781\.801\.801\.781\.781\.02CodeconcatLlama\-3\.1✘1\.821\.821\.861\.861\.821\.821\.02CodeconcatLlama\-4✔1\.841\.841\.851\.851\.831\.831\.001\.00CodeconcatLlama\-4✘1\.861\.861\.851\.851\.831\.830\.990\.99CodeconcatMistral\-3✔1\.801\.801\.791\.791\.781\.780\.990\.99CodeconcatMistral\-3✘1\.791\.791\.801\.801\.791\.791\.01CodeconcatOuro✔1\.531\.531\.571\.571\.541\.541\.02CodeconcatOuro✘1\.481\.481\.551\.551\.501\.501\.05CodeconcatQwen\-3✔1\.541\.541\.651\.651\.621\.621\.08CodeconcatQwen\-3✘1\.381\.381\.591\.591\.541\.541\.15OpenAI’stiktokenCodeconcatCL100K—8\.168\.168\.508\.507\.967\.961\.04CodeconcatO200K—5\.495\.495\.515\.515\.415\.411\.001\.00CodeconcatP50K—8\.298\.298\.908\.908\.058\.051\.07CodeconcatR50K—8\.438\.438\.858\.858\.018\.011\.05Hugging Face’stokenizersCodeper\-docCodeLlama✔2\.112\.113\.623\.623\.173\.171\.72Codeper\-docCodeLlama✘2\.112\.113\.623\.623\.163\.161\.72Codeper\-docDeepSeek\-3\.2✔1\.941\.942\.042\.041\.981\.981\.06Codeper\-docDeepSeek\-3\.2✘1\.811\.812\.002\.001\.911\.911\.11Codeper\-docGPT\-OSS✔2\.372\.372\.412\.412\.382\.381\.01Codeper\-docGPT\-OSS✘2\.452\.452\.512\.512\.472\.471\.02Codeper\-docLlama\-3\.1✔2\.362\.362\.462\.462\.382\.381\.04Codeper\-docLlama\-3\.1✘2\.432\.432\.512\.512\.432\.431\.03Codeper\-docLlama\-4✔2\.362\.362\.372\.372\.372\.371\.01Codeper\-docLlama\-4✘2\.302\.302\.372\.372\.332\.331\.03Codeper\-docMistral\-3✔2\.302\.302\.362\.362\.322\.321\.03Codeper\-docMistral\-3✘2\.342\.342\.412\.412\.362\.361\.03Codeper\-docOuro✔2\.012\.012\.082\.082\.022\.021\.03Codeper\-docOuro✘1\.871\.872\.012\.011\.931\.931\.07Codeper\-docQwen\-3✔1\.841\.842\.082\.082\.022\.021\.13Codeper\-docQwen\-3✘1\.641\.642\.032\.031\.921\.921\.24OpenAI’stiktokenCodeper\-docCL100K—8\.358\.358\.768\.768\.198\.191\.05Codeper\-docO200K—5\.585\.585\.645\.645\.435\.431\.01Codeper\-docP50K—8\.548\.549\.259\.258\.318\.311\.08Codeper\-docR50K—8\.668\.669\.189\.188\.218\.211\.06Table 5:Detailed throughput benchmarks for Hugging Face’stokenizers\.

## Appendix IProfiling Analysis of Tokenization Pipelines

To better understand the system\-level bottlenecks in existing tokenization pipelines, we profile the execution using Linuxperfand visualize the results withinferno\(Gjengset,[2026](https://arxiv.org/html/2605.30813#bib.bib37)\)\. All flame graphs are collected from end\-to\-end tokenization runs, including normalization, pre\-tokenization, and BPE, under the same configurations as the benchmark experiments in the main text\.

For clarity of presentation, we apply light post\-processing to the raw call stacks\. Specifically, we retain only frames corresponding to the tokenization pipeline, truncate excessively deep stacks, and strip type and template information, preserving only function\-level symbols and the structural hierarchy of call stacks\. These transformations preserve the relative time distribution across pipeline stages while improving readability\.

![Refer to caption](https://arxiv.org/html/2605.30813v1/x5.png)Figure 5:Flame graph oftokenizers\(Qwen\-3\) execution with our incremental non\-eager implementation on the Code dataset\.The BPE merge phase \(tokenize\_without\_cache\) accounts for only 13\.11% of the total execution time\. The remaining time is dominated by normalization, pre\-tokenization, and result construction cost \(e\.g\., cloning token strings\)\.![Refer to caption](https://arxiv.org/html/2605.30813v1/x6.png)Figure 6:Flame graph oftiktoken\(O200K\) execution with our incremental non\-eager implementation on the Code dataset\.The regex matching phase \(dominated byfind\_from\_pos\_with\_option\_flagsandrun\) consumes approximately 80\.25% of the total CPU time\. In contrast, the actual BPE merge phase \(byte\_pair\_encodeon the left\) accounts for only 6\.45%\. This indicates that, under this configuration, the pre\-tokenization guardrail imposes a computational overhead exceeding 12×\\timesthe cost of the core BPE merge stage\.k=6k=6

![Refer to caption](https://arxiv.org/html/2605.30813v1/x7.png)![Refer to caption](https://arxiv.org/html/2605.30813v1/x8.png)k=7k=7

k=8k=8

![Refer to caption](https://arxiv.org/html/2605.30813v1/x9.png)![Refer to caption](https://arxiv.org/html/2605.30813v1/x10.png)k=9k=9

k=10k=10

![Refer to caption](https://arxiv.org/html/2605.30813v1/x11.png)![Refer to caption](https://arxiv.org/html/2605.30813v1/x12.png)k=11k=11

Figure 7:Flame graphs oftiktokenusing the original baseline implementation under pathological inputs with lengths from262^\{6\}to2112^\{11\}\.The BPE merge phase \(byte\_pair\_encodeon the left\) becomes increasingly dominant as the input length grows, consistent with its𝒪​\(n2\)\\mathcal\{O\}\(n^\{2\}\)time complexity, while the relative contribution of regex matching diminishes\.[Figure5](https://arxiv.org/html/2605.30813#A9.F5)presents a representative flame graph for Hugging Face’stokenizerslibrary with the tokenizer Qwen\-3, using our incremental non\-eager BPE implementation on the Code dataset\. The BPE merge phase itself \(tokenize\_without\_cache\) constitutes a minor fraction of the total runtime, highlighting that, in this pipeline, the overall throughput is primarily constrained by non\-BPE components\.

[Figure6](https://arxiv.org/html/2605.30813#A9.F6)profilestiktoken\(O200K\) when replacing its BPE stage with our incremental non\-eager implementation on the Code dataset\. It highlights that the bottleneck shifts to regex\-based pre\-tokenization, even when the BPE merge phase itself is asymptotically efficient\.

[Figure7](https://arxiv.org/html/2605.30813#A9.F7)profiles the originaltiktokenbaseline under pathological inputs of increasing length\. As the input size grows, the BPE merge phase increasingly dominates the execution time, providing a fine\-grained explanation for the throughput trends reported in[Figure3](https://arxiv.org/html/2605.30813#S6.F3)\. These pathological inputs are not intended to model realistic workloads, but to isolate and stress\-test the asymptotic behavior of the BPE merge stage\.

## Appendix JComparison with Other Incremental BPE Implementations

In this section, we provide a detailed theoretical and empirical comparison between our algorithm and other existing incremental BPE implementations in Rust, specifically those found in the cratebpeof the GitHub repositoryrust\-gems\(GitHub,[2026](https://arxiv.org/html/2605.30813#bib.bib7)\)\.

For the discussion about time and space complexity, letΣ\\Sigmabe the size of the alphabet,nnbe the input length,mmthe vocabulary size,tit\_\{i\}the length of theii\-th token, andttthe maximum token length\.

Withinrust\-gems, there are three main implementations of BPE as the methods of the structureBytePairEncoding\. Theencode\_via\_bitfieldmethod utilizes a binary heap and does not support incremental processing, placing it outside the scope of discussion\. The remaining two methods,encode\_via\_tableandencode\_via\_backtracking, support incremental tokenization and are analyzed in the following\.

All empirical tests and evaluations were performed by directly replacing the core BPE implementation in thetiktoken\(OpenAI,[2026b](https://arxiv.org/html/2605.30813#bib.bib6)\)library\. The tests were conducted on a Virtual Private Server \(VPS\), utilizing 8 cores of an Intel®Xeon®Gold 6448Y CPU situated on a single NUMA node, alongside a 16 GB slice of RAM\.

### J\.1Theoretical Time Complexity

Bothencode\_via\_tableandencode\_via\_backtrackingrely on a core underlying function:is\_valid\_token\_pair\. This function checks whether two tokens can be placed adjacently, ensuring that applying BPE to their concatenated string yields exactly the same two tokens\.

Assuming the “last token” of a string is determined, if a new token can be validly placed adjacent to it, the “last token” of the newly formed string should exactly be this new token\.

The validation process takes two tokens, iteratively un\-merges either the first token to its successor or the second token to its predecessor according to the rule with the lower priority, and verifies if the token boundary survives\. This verification takes𝒪​\(t\)\\mathcal\{O\}\(t\)time\. Although it can be pre\-computed or cached to achieve𝒪​\(1\)\\mathcal\{O\}\(1\)query time, the table will be𝒪​\(m2\)\\mathcal\{O\}\(m^\{2\}\), which may not be preferable for production\.

#### Method 1:encode\_via\_table

This method processes the tokenization with an Aho–Corasick automaton\. At each step, it iterates through the matching suffix tokens from longest to shortest for the current prefix, validating each suffix token viais\_valid\_token\_pair\.

- •Complexity: In the “best” case scenario, where the longest match is always the correct last token, it still requires at least one validation per step, yielding a lower bound ofΩ​\(n\)\\Omega\(n\)checks\. In the worst case, all suffix tokens must be checked, resulting in𝒪​\(n​t2\)\\mathcal\{O\}\(nt^\{2\}\)time complexity\.
- •Eager Output: Because this method maintains the prefix tree of tokens essentially, it can theoretically be adapted to support our proposed eager output mechanism\.

#### Method 2:encode\_via\_backtracking

This method employs a greedy strategy\. It attempts to find the longest token among the prefixes of the un\-tokenized string suffix, verifying it against the currently maintained token sequence viais\_valid\_token\_pair\. During the tokenization, it maintains a bit\-field to record the positions that cause backtracking to avoid re\-computation\.

It is notable that the algorithm will backtrack if the current token sequence corresponds to a path from a leaf to the root in the prefix tree of tokens, which indicates no subsequent prefix token will form a valid sequence\.

- •Complexity:In the “best” case, where the greedy choice is always correct, the validation cost is amortized over the token length, yieldingΩ​\(n\)\\Omega\(n\)\. However, in the worst case, whenever the algorithm backtracks, it incurs the cost of validating all possible prefix tokens with the last token, yielding𝒪​\(n​t2\)\\mathcal\{O\}\(nt^\{2\}\)\.
- •Eager Output:As the algorithm is greedy, there is no trivial way to apply our eager output mechanism to this implementation\.

[Table6](https://arxiv.org/html/2605.30813#A10.T6)summarizes the complexity of these methods alongside our proposed algorithm\. Our method incorporates a simple optimization where we first verify the longest suffix token before traversing the Suffix\-Successor Tree, ensuring our “best” case also achieves anΩ​\(n\)\\Omega\(n\)lower bound\.

Table 6:Overall complexity and feature comparison\.†AssumeΩ​\(1\)\\Omega\(1\)time for each check\.

### J\.2Constructed Pathological Case

To approach the worst\-case complexity, we designed an adversarial corner case\. This construction deliberately traps algorithms relying on greedy or longest\-prefix matching into massive validation and backtracking cycles\.

First, to achieve a merge depth ofK=4096K=4096, we expand the base alphabet from single bytes to 2\-byte pairs\. To strictly prevent accidental cross\-boundary merges, we construct base tokensB1,B2,…,BKB\_\{1\},B\_\{2\},\\ldots,B\_\{K\}fromℬ\\mathcal\{B\}:

ℬ=\{\(i,j\)∣i,j∈\{0,1,…,255\},i<j\}​\.\\mathcal\{B\}=\\\{\(i,j\)\\mid i,j\\in\\\{0,1,\\dots,255\\\},i<j\\\}\\text\{\.\}The structural constraint ofℬ\\mathcal\{B\}guarantees that placing any two base tokens adjacent to each other will not trigger unintended cross\-boundary merges\.

Next, using theseKKdistinct base tokens, we construct a sequence𝒮\\mathcal\{S\}of4​K4Kbytes:

𝒮=B1​⋯​BK​BK​⋯​B1​\.\\mathcal\{S\}=B\_\{1\}\\cdots B\_\{K\}B\_\{K\}\\cdots B\_\{1\}\\text\{\.\}
We then strategically assign merge rule priorities to establish the trap: the absolute highest priority is assigned to the rule merging the two identical tokens located exactly at the sequence center, i\.e\.,\(BK,BK\)\(B\_\{K\},B\_\{K\}\)\. Subsequently, we inject deep nested merge rules originating from the center towards the two ends, i\.e\.:

\(BK−1,BK\),\(BK−2,BK−1​BK\),…and\(BK,BK−1\),\(BK​BK−1,BK−2\),…\(B\_\{K\-1\},B\_\{K\}\),\(B\_\{K\-2\},B\_\{K\-1\}B\_\{K\}\),\\ldots\\quad\\text\{and\}\\quad\(B\_\{K\},B\_\{K\-1\}\),\(B\_\{K\}B\_\{K\-1\},B\_\{K\-2\}\),\\ldotsestablishing a merge depth of𝒪​\(K\)\\mathcal\{O\}\(K\)\.

The input string is constructed as a concatenation of 128 repetitions of𝒮\\mathcal\{S\}\. Under this pathological input, our method, with worst\-case time complexity guarantees, required37 msto tokenize the sequence\. In stark contrast,encode\_via\_tableandencode\_via\_backtrackingrequired15\.3 sand23\.9 s, respectively\.

### J\.3Space Complexity

#### Initialization

Our implementation utilizes a Square\-Root Tiled Transition Table for𝒪​\(1\)\\mathcal\{O\}\(1\)transitions \(detailed in Appendix[F](https://arxiv.org/html/2605.30813#A6)\), requiringΘ​\(Σ\)\\Theta\(\\sqrt\{\\Sigma\}\)space per state\. In contrast,rust\-gemsuses a double\-array implementation of Aho\-Corasick automata\(Kandaet al\.,[2023](https://arxiv.org/html/2605.30813#bib.bib40)\), which performs transitions in𝒪​\(1\)\\mathcal\{O\}\(1\)time and normally requires𝒪​\(1\)\\mathcal\{O\}\(1\)space per state, but potentially up to𝒪​\(Σ\)\\mathcal\{O\}\(\\Sigma\)in the worst case\.

Additionally, our method requires pre\-computing a Centroid Search Tree based on Centroid Decomposition for each Suffix\-Successor Tree, leading to an initialization space complexity of𝒪​\(∑ti\)=𝒪​\(m​t\)\\mathcal\{O\}\(\\sum t\_\{i\}\)=\\mathcal\{O\}\(mt\)\.rust\-gemsrequires no more than𝒪​\(m\)\\mathcal\{O\}\(m\)auxiliary space\.

#### Tokenization and Eager Output

During standard non\-eager tokenization, both our method and therust\-gemsimplementations require𝒪​\(n\)\\mathcal\{O\}\(n\)space to record the token sequence\.

However, space complexity behavior differs when applying our eager output mechanism\. As detailed in[Section6](https://arxiv.org/html/2605.30813#S6), the algorithm maintains the state using a two\-pointer approach\. Token nodes may persist in the state even after they cease to belong to𝒫\\mathcal\{P\}, because they lie in the window\. Therefore, the space complexity relates to the maximum number of nodes simultaneously maintained by the two\-pointer algorithm\. While a theoretically safe upper bound is𝒪​\(min⁡\(n,m​t\)\)\\mathcal\{O\}\(\\min\(n,mt\)\), empirical evaluations show that the practical memory footprint is much smaller\.

[Table7](https://arxiv.org/html/2605.30813#A10.T7)demonstrates the maximum number of nodes in the window maintained by the two\-pointer algorithm with the datasets described in[Table2](https://arxiv.org/html/2605.30813#A8.T2)\.

Table 7:Maximum size in the window during the two\-pointer algorithm of eager output\.

### J\.4End\-to\-End Benchmarks

We evaluated the end\-to\-end tokenization speeds forrust\-gemsand ours\. The tokenizer model P50K failed to run withrust\-gems; thus, it was excluded from the benchmarks\.

The median CPU times \(in seconds\) across standard datasets are presented in[Table8](https://arxiv.org/html/2605.30813#A10.T8)\. The results are also demonstrated as a violin plot in[Figure8](https://arxiv.org/html/2605.30813#A10.F8)\. The baseline implementation is the original BPE implementation intiktoken\. Our incremental approach consistently matches or outperforms the backtracking and table\-based implementations\.

Table 8:End\-to\-end CPU time \(seconds, median\) compared with other incremental BPE implementations\.Lower is better\.![Refer to caption](https://arxiv.org/html/2605.30813v1/x13.png)

Figure 8:Distributions of end\-to\-end execution times for different BPE implementations\.The violin plots illustrate the latency \(in seconds, lower is better\) across three tokenizers \(R50K, CL100K, O200K\) and three datasets \(English, Chinese, Code\)\. All methods were evaluated by replacing the core BPE in thetiktokenlibrary\. Specifically, baseline refers to the original BPE implementation oftiktoken, whilebacktrackingandtablecorrespond to theencode\_via\_backtrackingandencode\_via\_tablemethods fromrust\-gems, respectively\. Colored dashed vertical lines indicate the mean execution times for the baseline,ours \(inc\), andbacktrackingimplementations\.

Similar Articles

Compute Optimal Tokenization (2 minute read)

TLDR AI

This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.

Fast Byte Latent Transformer

Hugging Face Daily Papers

This paper introduces BLT Diffusion and speculative decoding techniques for byte-level language models to significantly reduce generation latency and memory bandwidth costs while maintaining quality.

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Hugging Face Daily Papers

This paper proposes Byte-Level Distillation (BLD), a simple method for cross-tokenizer knowledge transfer in language models by operating at a shared byte-level interface, achieving competitive or superior performance compared to more complex existing approaches across 1B-8B parameter models.