A Triadic Suffix Tokenization Scheme for Numerical Reasoning
Summary
This paper introduces Triadic Suffix Tokenization (TST), a deterministic tokenization scheme that partitions digits into three-digit triads with explicit magnitude markers to improve numerical reasoning in large language models. The method addresses inconsistent number fragmentation in standard tokenizers by providing transparent order-of-magnitude relationships at the token level, with two implementation variants offering scalable vocabulary expansion.
View Cached Full Text
Cached at: 04/20/26, 08:32 AM
# A Triadic Suffix Tokenization Scheme for Numerical Reasoning Source: https://arxiv.org/html/2604.11582 Olga Chetverina Lecturer at MIPT, Moscow, Russia. Email: [email protected]. ORCID: 0000-0003-4924-8130 (https://orcid.org/). Previously archived at Zenodo: DOI 10.5281/zenodo.18999577. This research was conducted independently outside of any institutional assignments, and no institutional resources were used. ## Abstract Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure—a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic Suffix Tokenization (TST), a deterministic scheme that partitions digits into three-digit triads and annotates each triad with an explicit magnitude marker. Critically, the scheme defines a fixed, one-to-one mapping between suffixes and orders of magnitude for the integer part (thousands, millions, billions, etc.) and a parallel system of replicated markers for fractional depth (tenths, thousandths, millionths, etc.). Unlike approaches that rely on positional inference, this method provides a consistent gradient signal, which should ensure stable convergence. Two implementation variants are proposed: (1) a vocabulary-based approach that adds at most 10,000 fixed tokens to an existing vocabulary, covering 33 orders of magnitude (10^−15 to 10^18); and (2) a suffix-marker approach that uses a small set of special tokens to denote magnitude dynamically. Both variants preserve exact digits while making order-of-magnitude relationships transparent at the token level. While we focus on 3-digit groups (Triadic), the framework is inherently scalable to any group size for precise vocabulary optimization. Furthermore, it allows for linear vocabulary expansion to accommodate arbitrary precision and range. TST is architecture-agnostic and can be integrated as a drop-in preprocessing step. Experimental validation is deferred to future work. ## 1 Introduction Large Language Models (LLMs) have achieved remarkable performance on complex reasoning tasks, yet they frequently stumble on basic numerical understanding. The famous "9.11 > 9.9" failure exemplifies how models misjudge decimal magnitudes[3](https://arxiv.org/html/2604.11582#bib.bib3). This weakness stems largely from tokenization: numbers are fragmented into arbitrary subword units, losing their positional and magnitude information[4](https://arxiv.org/html/2604.11582#bib.bib4),[10](https://arxiv.org/html/2604.11582#bib.bib10). Standard tokenizers split "100400" into "100" and "400" without encoding that the former represents hundreds of thousands. As a result, models must learn magnitude relationships from scratch—a statistically inefficient task. Recent work has explored various solutions: specialized number tokens[6](https://arxiv.org/html/2604.11582#bib.bib6), continuous number encodings such as xVal[11](https://arxiv.org/html/2604.11582#bib.bib11), right-to-left tokenization with comma separation[5](https://arxiv.org/html/2604.11582#bib.bib5), and comprehensive benchmarks like NumericBench[1](https://arxiv.org/html/2604.11582#bib.bib1) and Number Cookbook[3](https://arxiv.org/html/2604.11582#bib.bib3). Studies comparing base-10 (digit-level) versus base-1000 (triad-level) tokenization show that digit-level approaches are more data-efficient, but struggle with magnitude comprehension[5](https://arxiv.org/html/2604.11582#bib.bib5),[4](https://arxiv.org/html/2604.11582#bib.bib4). This paper introduces a simple yet effective tokenization scheme that combines triad grouping (base-1000) with explicit magnitude annotations for both integer and fractional parts, providing a stronger inductive bias for numerical reasoning. ## 2 Related Work Digit-level tokenization (base-10) treats each digit as an independent token. While data-efficient, it lacks explicit magnitude cues, forcing models to infer order from positional patterns alone[5](https://arxiv.org/html/2604.11582#bib.bib5). Multi-digit tokenization, used by GPT-3.5 and GPT-4, reduces sequence length but introduces arbitrary boundaries that can fragment numbers inconsistently[4](https://arxiv.org/html/2604.11582#bib.bib4). xVal encodes numbers as continuous embeddings using a single token per number, demonstrating strong interpolation performance[11](https://arxiv.org/html/2604.11582#bib.bib11). However, it discards exact digits, trading precision for smoothness—a trade-off unsuitable for tasks requiring exact arithmetic. Right-to-left tokenization, which adds commas every three digits from the right, has been shown to improve integer arithmetic by up to 20% compared to unformatted numbers[4](https://arxiv.org/html/2604.11582#bib.bib4),[5](https://arxiv.org/html/2604.11582#bib.bib5). This demonstrates that even simple formatting shifts help models reason about numbers. However, commas only group digits; they do not indicate the magnitude of each group. A model still must infer whether "123" stands for 123, 123,000, or 123,000,000 from its position in the sequence. Recent work has also explored modifying the loss function rather than tokenization. Number Token Loss (NTL)[9](https://arxiv.org/html/2604.11582#bib.bib9) replaces the standard cross-entropy loss with a regression-like objective for numerical tokens, treating numbers as continuous values during training. This approach preserves exact digits while providing a smooth gradient signal. Importantly, NTL operates at the training level and is orthogonal to tokenization schemes—it can be combined with any input representation. Several established benchmarks can assess numerical reasoning in future studies: NumericBench evaluates six fundamental numerical capabilities[1](https://arxiv.org/html/2604.11582#bib.bib1), Number Cookbook covers 17 distinct numerical tasks[3](https://arxiv.org/html/2604.11582#bib.bib3), and studies probing enumeration skills show that even state-of-the-art models lack systematic enumeration[2](https://arxiv.org/html/2604.11582#bib.bib2). ## 3 Proposed Method: Triadic Suffix Tokenization (TST) ### 3.1 Core Principles 1. Group digits into triads (thousands, millions, etc.). 2. Annotate each triad with explicit magnitude markers. 3. Preserve exact digits. ### 3.2 Integer Part Digits are grouped from right to left: - 100400 → 100k 400 - 1234567 → 1m 234k 567 - 100200400 → 100m 200k 400 - 123456789012345 → 123t 456b 789m 012k 345 **Table 1: Integer suffixes** #### 3.2.1 Why Suffixes, Not Just Commas Unlike right-to-left commas, which only group digits, the suffixes k, m, b, t, q explicitly indicate the order of magnitude. This gives the model direct access to the scale of each digit group, rather than forcing it to infer magnitude from position. ### 3.3 Fractional Part Fractional digits are grouped left to right with replicated 'p' markers. To ensure a canonical representation and maintain a 1:1 mapping between tokens and numerical values, all fractional triads are right-padded with trailing zeros to a fixed length of three digits. This normalization ensures that numerically equivalent representations (e.g., 0.1 and 0.100) result in the same token sequence. - 1.12345678 → 1. 123p 456pp 780ppp - 0.0045 → 0. 004p 500pp - π → 3. 141p 592pp 653ppp 589pppp 793ppppp Maximum practical depth is typically 5 'p's (15 decimal places). Without padding, the vocabulary would unnecessarily expand to include sub-triadic variations, and the model would be forced to learn that 0.1, 0.10, and 0.100 represent the same value independently. Padding ensures that each token consistently represents a value in the form n × 10^−3k where n ∈ [0,999], requiring only 1,000 tokens for each triad depth. This normalization prioritizes numerical consistency and computational precision over the preservation of the original surface form. While mapping numerically equivalent values to a single sequence improves convergence, some tasks require trailing zeros to carry semantic meaning. A detailed discussion of this trade-off and a proposed minor extension for precision-preserving applications are provided in Section 6.2. ### 3.4 Vocabulary Considerations Two options are possible: - **Option A (Separate tokens):** Keep digit groups and suffixes as separate tokens. Vocabulary adds only 10 new tokens (k, m, b, t, q, p, pp, ppp, pppp, ppppp). This minimally expands the vocabulary; models may learn to combine digits with suffixes. - **Option B (Compound tokens):** Create combined tokens like "100k", "234m". With triads from 000–999 and 5 integer suffixes: 1000 × 5 = 5000 tokens. For fractions with up to 5 'p' depths and triads 000–999: another 5000 tokens. Total 10,000 new tokens—manageable for modern vocabularies. Option B produces shorter input sequences (lower token count) and presents the model with ready-made magnitude-digit units, eliminating any ambiguity about which suffix belongs to which digit group. The smaller vocabulary of Option A is appealing, but the trade-off between sequence length and vocabulary size remains an empirical question. While we focus on N=3 (Triadic) for its alignment with human readability and its effective balance between vocabulary size and sequence length, the framework is fundamentally designed for any group size N, allowing for precise optimization of the model's vocabulary footprint. It is worth noting that Option B produces the exact same number of input tokens as a simple division into groups of three. Even in the extreme case of N=1, the input context length remains no larger than that of standard digit-by-digit representations, as each digit is fused with its magnitude marker into a single atomic token. While Option A is more "token-greedy" because it emits separate markers, it may still be practical for comparatively larger group sizes (N ≥ 3) where the relative overhead becomes negligible. The complete deterministic procedure for the TST scheme, illustrating the generation of magnitude-aware compound tokens as described in Option B, is formalized in Algorithm 1. For Option A, the algorithm remains identical except in Step 4, where the digit group and the suffix are appended as two separate tokens instead of a single compound unit. **Algorithm 1: Generalized Suffix Tokenization (GST) Mapping** Input: A raw numerical string S, and internal group size N optimized for the model's vocabulary. Output: A sequence of magnitude-aware tokens T ``` // Step 1: Normalization and Cleansing Prefix ← ExtractPrefix(S) S' ← NormalizeToStandardDigits(S) I, F ← SplitIntoIntegerAndFractionalParts(S') // Step 2: Integer Processing ℐ ← DivideIntoGroupsOfSizeFromRightToLeft(I, N) I' ← PadWithZerosFromLeftToMultipleOf(ℐ, N) // Step 3: Canonical Fraction Processing ℱ ← DivideIntoGroupsOfSizeFromLeftToRight(F, N) F' ← PadWithZerosFromRightToMultipleOf(ℱ, N) // Step 4: Tokens Generation with Suffixes T ← [] if Prefix is not empty then T.append(Prefix) endif // Process Integer Parts (Option B depicted) for each group Gₖ in ℐ at magnitude order k do suffix ← GetMagnitudeSuffix(k) T.append(Gₖ + suffix) endfor T.append(".") // Process Fractional Parts for each group Gₚ in ℱ at decimal depth p do suffix ← GetPrecisionSuffix(p) T.append(Gₚ + suffix) endfor return T ``` ## 4 Annotation Placement: Prefix vs. Suffix A related line of work, NumeroLogic[8](https://arxiv.org/html/2604.11582#bib.bib8), adds a prefix indicating the total number of digits in a number (e.g., 2 for 42). On short-number arithmetic (addition of numbers up to three digits), this approach improves accuracy, demonstrating that explicit structural annotations help. However, NumeroLogic provides a single prefix per number and does not annotate individual triads. Its effectiveness on longer numbers or fractional precision tasks remains an open question. TST, in contrast, is designed to scale across 33 orders of magnitude (10^−15 to 10^18) by annotating each triad with its magnitude. Given the success of prefix-based length annotation on short numbers, one might expect similar or stronger benefits from TST's more detailed per-triad annotation across the full numeric range. Beyond the choice between a single length prefix and per-triad suffixes, there is also the question of where to place the marker within each triad: before the digits (prefix) or after (suffix). For example, the number 123,456 could be represented as: - **Suffix format:** 123k 456 - **Prefix format:** k123 456 When using compound tokens (Option B: 123k or k123 as a single token), the distinction between prefix and suffix disappears at the token level—both become atomic vocabulary units, and the embedding layer treats them identically regardless of internal order. However, when numbers are tokenized digit by digit, suffix placement offers several advantages: - **Deterministic boundaries:** The suffix marks the end of a triad. For 1 2 3 k 4, the model knows that "1 2 3" form a complete thousand group before seeing further digits. With a prefix (k 1 2 3 4), the model cannot know where the group ends. - **Alignment with human reading:** Suffix notation (e.g., "123k") matches the natural way numbers are written with separators (123,000) and spoken ("one hundred twenty-three thousand"). - **Implicit length information:** A suffix like q (quadrillion) tells the model that exactly 15 more digits (five triads) follow, providing a strong prior for the total length of the integer part. Prefix placement could potentially provide the magnitude information slightly earlier, but it does so at the cost of slight boundary ambiguity. For these reasons, we adopt the **suffix** placement for TST when digit-level tokenization is used. In the compound-token variant (Option B), the choice is immaterial. The suffix versus prefix question remains an empirical one for some cases, but the deterministic boundary property makes suffix a natural and robust default. ## 5 Statistical Analysis ### 5.1 Comparison Framework The comparison of different approaches is provided in Table 2. **Table 2: Comparison of tokenization schemes: inductive biases**
Similar Articles
Compute Optimal Tokenization (2 minute read)
This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.
How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning
Study reveals that answer tokens in thinking LLMs follow a structured self-reading pattern—forward drift plus focus on key anchors—during quantitative reasoning, and proposes a training-free SRQ steering method to exploit this for accuracy gains.
Efficient Pre-Training with Token Superposition
Token-Superposition Training (TST) improves LLM pre-training efficiency by combining contiguous tokens into bags during a superposition phase with a multi-hot cross-entropy objective, achieving up to 2.5x reduction in training time without architectural changes.
Stochasticity in Tokenization Improves Robustness
This paper demonstrates that training large language models with stochastic tokenization instead of deterministic canonical tokenization significantly improves robustness to adversarial attacks and random perturbations, with improvements shown across pre-training, fine-tuning, and in-context learning without increasing inference costs.
TabularMath: Understanding Math Reasoning over Tables with Large Language Models
TabularMath introduces a benchmark and AutoT2T framework for evaluating LLMs' mathematical reasoning over tabular data, revealing that table complexity, data quality, and modality significantly impact model performance. The study addresses a gap in LLM evaluation by systematically assessing robustness to incomplete or inconsistent table information in real-world scenarios.