Compute Optimal Tokenization (2 minute read)

TLDR AI Papers

Summary

This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.

Researchers derived compression-aware neural scaling laws by training nearly 1,300 models, revealing how bytes per token affect compute allocation. This challenges the heuristic that scales models by 20 tokens per parameter, showing it's due to specific tokenizers. The study suggests scaling should use bytes, not tokens, for better compute efficiency across diverse languages.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/14/26, 12:10 AM

# Compute Optimal Tokenization Source: [https://arxiviq.substack.com/p/compute-optimal-tokenization](https://arxiviq.substack.com/p/compute-optimal-tokenization) **Authors:***Tomasz Limisiewicz, Artidoro Pagnoni, Srini Iyer, Mike Lewis, Sachin Mehta, Alisa Liu, Margaret Li, Gargi Ghosh, Luke Zettlemoyer* **Paper:**[https://arxiv\.org/abs/2605\.01188v1](https://arxiv.org/abs/2605.01188v1) **Code:**[https://co\-tok\.github\.io](https://co-tok.github.io/) **WHAT was done?**The authors systematically derived compression\-aware neural scaling laws by training nearly 1,300 models to determine how information granularity \(bytes per token\) impacts optimal compute allocation\. **WHY it matters?**This work proves that the widely accepted heuristic of scaling models by 20 tokens per parameter is an artifact of specific subword tokenizers\. Establishing a tokenizer\-agnostic scaling law based on bytes provides a robust framework for maximizing compute efficiency across diverse languages and modalities\. **Executive summary:**For research teams optimizing large\-scale pre\-training runs, the tokenization scheme is often treated as a static preprocessing step\. This paper reframes tokenization as a dynamic scaling variable\. By optimizing the “compression rate” \(information density\), the authors demonstrate that training data should scale proportionally to model parameters in*bytes*, not tokens\. Furthermore, they reveal that the optimal compression rate is compute\-dependent, requiring lower compression as FLOP budgets scale up, thus offering a new blueprint for training highly efficient, massively multilingual foundation models\. [![](https://substackcdn.com/image/fetch/$s_!FDxH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87babda4-43d1-4389-883a-172a4cbe0fe9_5504x3072.jpeg)](https://substackcdn.com/image/fetch/$s_!FDxH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87babda4-43d1-4389-883a-172a4cbe0fe9_5504x3072.jpeg) Foundation model scaling is largely governed by established scaling laws, most notably the heuristic derived in[Training Compute\-Optimal Large Language Models](https://arxiv.org/abs/2203.15556)\(Chinchilla\), which posits an optimal ratio of approximately 20 training tokens per model parameter\. However, a critical blind spot in this heuristic is its reliance on a fixed tokenization scheme\. Expressing data volume strictly in tokens ignores the variable information density that each token represents, essentially binding fundamental scaling behavior to the arbitrary mechanics of Byte\-Pair Encoding \(BPE\) tokenizers\. This study isolates the token as a variable to identify the true invariant in scaling behavior, exposing the extent to which popular tokenizers inherently skew compute allocation\.

Similar Articles

Stochasticity in Tokenization Improves Robustness

arXiv cs.CL

This paper demonstrates that training large language models with stochastic tokenization instead of deterministic canonical tokenization significantly improves robustness to adversarial attacks and random perturbations, with improvements shown across pre-training, fine-tuning, and in-context learning without increasing inference costs.

(1D) Ordered Tokens Enable Efficient Test-Time Search

Hugging Face Daily Papers

This paper investigates how 1D coarse-to-fine token structures in autoregressive models improve test-time search efficiency compared to classical 2D grid tokenization. The authors show that such ordered tokens enable better test-time scaling and even training-free text-to-image generation when guided by image-text verifiers.

Scaling laws for neural language models

OpenAI Blog

Foundational empirical study demonstrating power-law scaling relationships between language model performance and model size, dataset size, and compute budget, with implications for optimal training allocation and sample efficiency.

Optimizing Korean-Centric LLMs via Token Pruning

arXiv cs.CL

This paper presents a systematic benchmark of token pruning—a compression technique that removes tokens and embeddings for irrelevant languages—applied to Korean-centric LLM tasks. The study evaluates popular multilingual models (Qwen3, Gemma-3, Llama-3, Aya) across different vocabulary configurations and finds that token pruning significantly improves generation stability and reduces memory footprint for domain-specific deployments.