Compute Optimal Tokenization (2 minute read)

TLDR AI 05/13/26, 12:00 AM Papers

tokenization scaling-laws compute-optimal neural-networks large-language-models compression efficiency

Summary

This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.

Researchers derived compression-aware neural scaling laws by training nearly 1,300 models, revealing how bytes per token affect compute allocation. This challenges the heuristic that scales models by 20 tokens per parameter, showing it's due to specific tokenizers. The study suggests scaling should use bytes, not tokens, for better compute efficiency across diverse languages.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/14/26, 12:10 AM

# Compute Optimal Tokenization Source: [https://arxiviq.substack.com/p/compute-optimal-tokenization](https://arxiviq.substack.com/p/compute-optimal-tokenization) **Authors:***Tomasz Limisiewicz, Artidoro Pagnoni, Srini Iyer, Mike Lewis, Sachin Mehta, Alisa Liu, Margaret Li, Gargi Ghosh, Luke Zettlemoyer* **Paper:**[https://arxiv\.org/abs/2605\.01188v1](https://arxiv.org/abs/2605.01188v1) **Code:**[https://co\-tok\.github\.io](https://co-tok.github.io/) **WHAT was done?**The authors systematically derived compression\-aware neural scaling laws by training nearly 1,300 models to determine how information granularity $bytes per token$ impacts optimal compute allocation\. **WHY it matters?**This work proves that the widely accepted heuristic of scaling models by 20 tokens per parameter is an artifact of specific subword tokenizers\. Establishing a tokenizer\-agnostic scaling law based on bytes provides a robust framework for maximizing compute efficiency across diverse languages and modalities\. **Executive summary:**For research teams optimizing large\-scale pre\-training runs, the tokenization scheme is often treated as a static preprocessing step\. This paper reframes tokenization as a dynamic scaling variable\. By optimizing the “compression rate” $information density$, the authors demonstrate that training data should scale proportionally to model parameters in*bytes*, not tokens\. Furthermore, they reveal that the optimal compression rate is compute\-dependent, requiring lower compression as FLOP budgets scale up, thus offering a new blueprint for training highly efficient, massively multilingual foundation models\. [![](https://substackcdn.com/image/fetch/$s_!FDxH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87babda4-43d1-4389-883a-172a4cbe0fe9_5504x3072.jpeg)](https://substackcdn.com/image/fetch/$s_!FDxH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87babda4-43d1-4389-883a-172a4cbe0fe9_5504x3072.jpeg) Foundation model scaling is largely governed by established scaling laws, most notably the heuristic derived in[Training Compute\-Optimal Large Language Models](https://arxiv.org/abs/2203.15556)$Chinchilla$, which posits an optimal ratio of approximately 20 training tokens per model parameter\. However, a critical blind spot in this heuristic is its reliance on a fixed tokenization scheme\. Expressing data volume strictly in tokens ignores the variable information density that each token represents, essentially binding fundamental scaling behavior to the arbitrary mechanics of Byte\-Pair Encoding $BPE$ tokenizers\. This study isolates the token as a variable to identify the true invariant in scaling behavior, exposing the extent to which popular tokenizers inherently skew compute allocation\.

Compute Optimal Tokenization (2 minute read)

Similar Articles

Stochasticity in Tokenization Improves Robustness

(1D) Ordered Tokens Enable Efficient Test-Time Search

From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

Scaling laws for neural language models

Optimizing Korean-Centric LLMs via Token Pruning

Submit Feedback

Similar Articles

Stochasticity in Tokenization Improves Robustness

(1D) Ordered Tokens Enable Efficient Test-Time Search

From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

Scaling laws for neural language models

Optimizing Korean-Centric LLMs via Token Pruning