A Triadic Suffix Tokenization Scheme for Numerical Reasoning

arXiv cs.CL 04/20/26, 04:00 AM Papers

tokenization numerical-reasoning llm preprocessing arithmetic magnitude-encoding

Summary

This paper introduces Triadic Suffix Tokenization (TST), a deterministic tokenization scheme that partitions digits into three-digit triads with explicit magnitude markers to improve numerical reasoning in large language models. The method addresses inconsistent number fragmentation in standard tokenizers by providing transparent order-of-magnitude relationships at the token level, with two implementation variants offering scalable vocabulary expansion.

arXiv:2604.11582v2 Announce Type: replace Abstract: Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure - a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic Suffix Tokenization (TST), a deterministic scheme that partitions digits into three-digit triads and annotates each triad with an explicit magnitude marker. Critically, the scheme defines a fixed, one-to-one mapping between suffixes and orders of magnitude for the integer part (thousands, millions, billions, etc.) and a parallel system of replicated markers for fractional depth (tenths, thousandths, millionths, etc.). Unlike approaches that rely on positional inference, this method provides a consistent gradient signal, which should ensure stable convergence. Two implementation variants are proposed: (1) a vocabulary-based approach that adds at most 10,000 fixed tokens to an existing vocabulary, covering 33 orders of magnitude ($10^{-15}$ to $10^{18}$); and (2) a suffix-marker approach that uses a small set of special tokens to denote magnitude dynamically. Both variants preserve exact digits while making order-of-magnitude relationships transparent at the token level. While we focus on 3-digit groups (Triadic), the framework is inherently scalable to any group size for precise vocabulary optimization. Furthermore, it allows for linear vocabulary expansion to accommodate arbitrary precision and range. TST is architecture-agnostic and can be integrated as a drop-in preprocessing step. Experimental validation is deferred to future work.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# A Triadic Suffix Tokenization Scheme for Numerical Reasoning

Source: https://arxiv.org/html/2604.11582

Olga Chetverina
Lecturer at MIPT, Moscow, Russia. Email: [email protected]. ORCID: 0000-0003-4924-8130 (https://orcid.org/). Previously archived at Zenodo: DOI 10.5281/zenodo.18999577. This research was conducted independently outside of any institutional assignments, and no institutional resources were used.

## Abstract

Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure—a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic Suffix Tokenization (TST), a deterministic scheme that partitions digits into three-digit triads and annotates each triad with an explicit magnitude marker. Critically, the scheme defines a fixed, one-to-one mapping between suffixes and orders of magnitude for the integer part (thousands, millions, billions, etc.) and a parallel system of replicated markers for fractional depth (tenths, thousandths, millionths, etc.). Unlike approaches that rely on positional inference, this method provides a consistent gradient signal, which should ensure stable convergence. Two implementation variants are proposed: (1) a vocabulary-based approach that adds at most 10,000 fixed tokens to an existing vocabulary, covering 33 orders of magnitude (10^−15 to 10^18); and (2) a suffix-marker approach that uses a small set of special tokens to denote magnitude dynamically. Both variants preserve exact digits while making order-of-magnitude relationships transparent at the token level. While we focus on 3-digit groups (Triadic), the framework is inherently scalable to any group size for precise vocabulary optimization. Furthermore, it allows for linear vocabulary expansion to accommodate arbitrary precision and range. TST is architecture-agnostic and can be integrated as a drop-in preprocessing step. Experimental validation is deferred to future work.

## 1 Introduction

Large Language Models (LLMs) have achieved remarkable performance on complex reasoning tasks, yet they frequently stumble on basic numerical understanding. The famous "9.11 > 9.9" failure exemplifies how models misjudge decimal magnitudes[3](https://arxiv.org/html/2604.11582#bib.bib3). This weakness stems largely from tokenization: numbers are fragmented into arbitrary subword units, losing their positional and magnitude information[4](https://arxiv.org/html/2604.11582#bib.bib4),[10](https://arxiv.org/html/2604.11582#bib.bib10). Standard tokenizers split "100400" into "100" and "400" without encoding that the former represents hundreds of thousands. As a result, models must learn magnitude relationships from scratch—a statistically inefficient task.

Recent work has explored various solutions: specialized number tokens[6](https://arxiv.org/html/2604.11582#bib.bib6), continuous number encodings such as xVal[11](https://arxiv.org/html/2604.11582#bib.bib11), right-to-left tokenization with comma separation[5](https://arxiv.org/html/2604.11582#bib.bib5), and comprehensive benchmarks like NumericBench[1](https://arxiv.org/html/2604.11582#bib.bib1) and Number Cookbook[3](https://arxiv.org/html/2604.11582#bib.bib3). Studies comparing base-10 (digit-level) versus base-1000 (triad-level) tokenization show that digit-level approaches are more data-efficient, but struggle with magnitude comprehension[5](https://arxiv.org/html/2604.11582#bib.bib5),[4](https://arxiv.org/html/2604.11582#bib.bib4). This paper introduces a simple yet effective tokenization scheme that combines triad grouping (base-1000) with explicit magnitude annotations for both integer and fractional parts, providing a stronger inductive bias for numerical reasoning.

## 2 Related Work

Digit-level tokenization (base-10) treats each digit as an independent token. While data-efficient, it lacks explicit magnitude cues, forcing models to infer order from positional patterns alone[5](https://arxiv.org/html/2604.11582#bib.bib5). Multi-digit tokenization, used by GPT-3.5 and GPT-4, reduces sequence length but introduces arbitrary boundaries that can fragment numbers inconsistently[4](https://arxiv.org/html/2604.11582#bib.bib4).

xVal encodes numbers as continuous embeddings using a single token per number, demonstrating strong interpolation performance[11](https://arxiv.org/html/2604.11582#bib.bib11). However, it discards exact digits, trading precision for smoothness—a trade-off unsuitable for tasks requiring exact arithmetic.

Right-to-left tokenization, which adds commas every three digits from the right, has been shown to improve integer arithmetic by up to 20% compared to unformatted numbers[4](https://arxiv.org/html/2604.11582#bib.bib4),[5](https://arxiv.org/html/2604.11582#bib.bib5). This demonstrates that even simple formatting shifts help models reason about numbers. However, commas only group digits; they do not indicate the magnitude of each group. A model still must infer whether "123" stands for 123, 123,000, or 123,000,000 from its position in the sequence.

Recent work has also explored modifying the loss function rather than tokenization. Number Token Loss (NTL)[9](https://arxiv.org/html/2604.11582#bib.bib9) replaces the standard cross-entropy loss with a regression-like objective for numerical tokens, treating numbers as continuous values during training. This approach preserves exact digits while providing a smooth gradient signal. Importantly, NTL operates at the training level and is orthogonal to tokenization schemes—it can be combined with any input representation.

Several established benchmarks can assess numerical reasoning in future studies: NumericBench evaluates six fundamental numerical capabilities[1](https://arxiv.org/html/2604.11582#bib.bib1), Number Cookbook covers 17 distinct numerical tasks[3](https://arxiv.org/html/2604.11582#bib.bib3), and studies probing enumeration skills show that even state-of-the-art models lack systematic enumeration[2](https://arxiv.org/html/2604.11582#bib.bib2).

## 3 Proposed Method: Triadic Suffix Tokenization (TST)

### 3.1 Core Principles

1. Group digits into triads (thousands, millions, etc.).
2. Annotate each triad with explicit magnitude markers.
3. Preserve exact digits.

### 3.2 Integer Part

Digits are grouped from right to left:

- 100400 → 100k 400
- 1234567 → 1m 234k 567
- 100200400 → 100m 200k 400
- 123456789012345 → 123t 456b 789m 012k 345

**Table 1: Integer suffixes**

#### 3.2.1 Why Suffixes, Not Just Commas

Unlike right-to-left commas, which only group digits, the suffixes k, m, b, t, q explicitly indicate the order of magnitude. This gives the model direct access to the scale of each digit group, rather than forcing it to infer magnitude from position.

### 3.3 Fractional Part

Fractional digits are grouped left to right with replicated 'p' markers. To ensure a canonical representation and maintain a 1:1 mapping between tokens and numerical values, all fractional triads are right-padded with trailing zeros to a fixed length of three digits. This normalization ensures that numerically equivalent representations (e.g., 0.1 and 0.100) result in the same token sequence.

- 1.12345678 → 1. 123p 456pp 780ppp
- 0.0045 → 0. 004p 500pp
- π → 3. 141p 592pp 653ppp 589pppp 793ppppp

Maximum practical depth is typically 5 'p's (15 decimal places). Without padding, the vocabulary would unnecessarily expand to include sub-triadic variations, and the model would be forced to learn that 0.1, 0.10, and 0.100 represent the same value independently. Padding ensures that each token consistently represents a value in the form n × 10^−3k where n ∈ [0,999], requiring only 1,000 tokens for each triad depth. This normalization prioritizes numerical consistency and computational precision over the preservation of the original surface form.

While mapping numerically equivalent values to a single sequence improves convergence, some tasks require trailing zeros to carry semantic meaning. A detailed discussion of this trade-off and a proposed minor extension for precision-preserving applications are provided in Section 6.2.

### 3.4 Vocabulary Considerations

Two options are possible:

- **Option A (Separate tokens):** Keep digit groups and suffixes as separate tokens. Vocabulary adds only 10 new tokens (k, m, b, t, q, p, pp, ppp, pppp, ppppp). This minimally expands the vocabulary; models may learn to combine digits with suffixes.

- **Option B (Compound tokens):** Create combined tokens like "100k", "234m". With triads from 000–999 and 5 integer suffixes: 1000 × 5 = 5000 tokens. For fractions with up to 5 'p' depths and triads 000–999: another 5000 tokens. Total 10,000 new tokens—manageable for modern vocabularies.

Option B produces shorter input sequences (lower token count) and presents the model with ready-made magnitude-digit units, eliminating any ambiguity about which suffix belongs to which digit group. The smaller vocabulary of Option A is appealing, but the trade-off between sequence length and vocabulary size remains an empirical question.

While we focus on N=3 (Triadic) for its alignment with human readability and its effective balance between vocabulary size and sequence length, the framework is fundamentally designed for any group size N, allowing for precise optimization of the model's vocabulary footprint. It is worth noting that Option B produces the exact same number of input tokens as a simple division into groups of three. Even in the extreme case of N=1, the input context length remains no larger than that of standard digit-by-digit representations, as each digit is fused with its magnitude marker into a single atomic token. While Option A is more "token-greedy" because it emits separate markers, it may still be practical for comparatively larger group sizes (N ≥ 3) where the relative overhead becomes negligible.

The complete deterministic procedure for the TST scheme, illustrating the generation of magnitude-aware compound tokens as described in Option B, is formalized in Algorithm 1. For Option A, the algorithm remains identical except in Step 4, where the digit group and the suffix are appended as two separate tokens instead of a single compound unit.

**Algorithm 1: Generalized Suffix Tokenization (GST) Mapping**

Input: A raw numerical string S, and internal group size N optimized for the model's vocabulary.
Output: A sequence of magnitude-aware tokens T

```
// Step 1: Normalization and Cleansing
Prefix ← ExtractPrefix(S)
S' ← NormalizeToStandardDigits(S)
I, F ← SplitIntoIntegerAndFractionalParts(S')

// Step 2: Integer Processing
ℐ ← DivideIntoGroupsOfSizeFromRightToLeft(I, N)
I' ← PadWithZerosFromLeftToMultipleOf(ℐ, N)

// Step 3: Canonical Fraction Processing
ℱ ← DivideIntoGroupsOfSizeFromLeftToRight(F, N)
F' ← PadWithZerosFromRightToMultipleOf(ℱ, N)

// Step 4: Tokens Generation with Suffixes
T ← []
if Prefix is not empty then
T.append(Prefix)
endif

// Process Integer Parts (Option B depicted)
for each group Gₖ in ℐ at magnitude order k do
suffix ← GetMagnitudeSuffix(k)
T.append(Gₖ + suffix)
endfor

T.append(".")

// Process Fractional Parts
for each group Gₚ in ℱ at decimal depth p do
suffix ← GetPrecisionSuffix(p)
T.append(Gₚ + suffix)
endfor

return T
```

## 4 Annotation Placement: Prefix vs. Suffix

A related line of work, NumeroLogic[8](https://arxiv.org/html/2604.11582#bib.bib8), adds a prefix indicating the total number of digits in a number (e.g., 2 for 42). On short-number arithmetic (addition of numbers up to three digits), this approach improves accuracy, demonstrating that explicit structural annotations help. However, NumeroLogic provides a single prefix per number and does not annotate individual triads. Its effectiveness on longer numbers or fractional precision tasks remains an open question.

TST, in contrast, is designed to scale across 33 orders of magnitude (10^−15 to 10^18) by annotating each triad with its magnitude. Given the success of prefix-based length annotation on short numbers, one might expect similar or stronger benefits from TST's more detailed per-triad annotation across the full numeric range.

Beyond the choice between a single length prefix and per-triad suffixes, there is also the question of where to place the marker within each triad: before the digits (prefix) or after (suffix). For example, the number 123,456 could be represented as:

- **Suffix format:** 123k 456
- **Prefix format:** k123 456

When using compound tokens (Option B: 123k or k123 as a single token), the distinction between prefix and suffix disappears at the token level—both become atomic vocabulary units, and the embedding layer treats them identically regardless of internal order. However, when numbers are tokenized digit by digit, suffix placement offers several advantages:

- **Deterministic boundaries:** The suffix marks the end of a triad. For 1 2 3 k 4, the model knows that "1 2 3" form a complete thousand group before seeing further digits. With a prefix (k 1 2 3 4), the model cannot know where the group ends.
- **Alignment with human reading:** Suffix notation (e.g., "123k") matches the natural way numbers are written with separators (123,000) and spoken ("one hundred twenty-three thousand").
- **Implicit length information:** A suffix like q (quadrillion) tells the model that exactly 15 more digits (five triads) follow, providing a strong prior for the total length of the integer part.

Prefix placement could potentially provide the magnitude information slightly earlier, but it does so at the cost of slight boundary ambiguity. For these reasons, we adopt the **suffix** placement for TST when digit-level tokenization is used. In the compound-token variant (Option B), the choice is immaterial. The suffix versus prefix question remains an empirical one for some cases, but the deterministic boundary property makes suffix a natural and robust default.

## 5 Statistical Analysis

### 5.1 Comparison Framework

The comparison of different approaches is provided in Table 2.

**Table 2: Comparison of tokenization schemes: inductive biases**

A Triadic Suffix Tokenization Scheme for Numerical Reasoning

Similar Articles

Compute Optimal Tokenization (2 minute read)

How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

Efficient Pre-Training with Token Superposition

Stochasticity in Tokenization Improves Robustness

TabularMath: Understanding Math Reasoning over Tables with Large Language Models

Submit Feedback

Similar Articles

Compute Optimal Tokenization (2 minute read)

How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

Efficient Pre-Training with Token Superposition

Stochasticity in Tokenization Improves Robustness

TabularMath: Understanding Math Reasoning over Tables with Large Language Models