Compute Optimal Tokenization (2 minute read)

TLDR AI Papers

Summary

This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.

Researchers derived compression-aware neural scaling laws by training nearly 1,300 models, revealing how bytes per token affect compute allocation. This challenges the heuristic that scales models by 20 tokens per parameter, showing it's due to specific tokenizers. The study suggests scaling should use bytes, not tokens, for better compute efficiency across diverse languages.
Original Article
View Cached Full Text

Cached at: 05/14/26, 12:10 AM

# Compute Optimal Tokenization Source: [https://arxiviq.substack.com/p/compute-optimal-tokenization](https://arxiviq.substack.com/p/compute-optimal-tokenization) **Authors:***Tomasz Limisiewicz, Artidoro Pagnoni, Srini Iyer, Mike Lewis, Sachin Mehta, Alisa Liu, Margaret Li, Gargi Ghosh, Luke Zettlemoyer* **Paper:**[https://arxiv\.org/abs/2605\.01188v1](https://arxiv.org/abs/2605.01188v1) **Code:**[https://co\-tok\.github\.io](https://co-tok.github.io/) **WHAT was done?**The authors systematically derived compression\-aware neural scaling laws by training nearly 1,300 models to determine how information granularity \(bytes per token\) impacts optimal compute allocation\. **WHY it matters?**This work proves that the widely accepted heuristic of scaling models by 20 tokens per parameter is an artifact of specific subword tokenizers\. Establishing a tokenizer\-agnostic scaling law based on bytes provides a robust framework for maximizing compute efficiency across diverse languages and modalities\. **Executive summary:**For research teams optimizing large\-scale pre\-training runs, the tokenization scheme is often treated as a static preprocessing step\. This paper reframes tokenization as a dynamic scaling variable\. By optimizing the “compression rate” \(information density\), the authors demonstrate that training data should scale proportionally to model parameters in*bytes*, not tokens\. Furthermore, they reveal that the optimal compression rate is compute\-dependent, requiring lower compression as FLOP budgets scale up, thus offering a new blueprint for training highly efficient, massively multilingual foundation models\. [![](https://substackcdn.com/image/fetch/$s_!FDxH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87babda4-43d1-4389-883a-172a4cbe0fe9_5504x3072.jpeg)](https://substackcdn.com/image/fetch/$s_!FDxH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87babda4-43d1-4389-883a-172a4cbe0fe9_5504x3072.jpeg) Foundation model scaling is largely governed by established scaling laws, most notably the heuristic derived in[Training Compute\-Optimal Large Language Models](https://arxiv.org/abs/2203.15556)\(Chinchilla\), which posits an optimal ratio of approximately 20 training tokens per model parameter\. However, a critical blind spot in this heuristic is its reliance on a fixed tokenization scheme\. Expressing data volume strictly in tokens ignores the variable information density that each token represents, essentially binding fundamental scaling behavior to the arbitrary mechanics of Byte\-Pair Encoding \(BPE\) tokenizers\. This study isolates the token as a variable to identify the true invariant in scaling behavior, exposing the extent to which popular tokenizers inherently skew compute allocation\.

Similar Articles

Finding Optimal Tokenizers

Hacker News Top

This blog post presents an algorithm using integer linear programming to compute optimal tokenizers for language models, drawing parallels to solving the Traveling Salesman Problem. It notes that while the result is theoretically interesting, practical tokenizers are already near-optimal and the method may not generalize well.

Token maxxing

Reddit r/singularity

Discusses strategies and techniques for maximizing token usage in large language models to improve efficiency and output quality.

Stochasticity in Tokenization Improves Robustness

arXiv cs.CL

This paper demonstrates that training large language models with stochastic tokenization instead of deterministic canonical tokenization significantly improves robustness to adversarial attacks and random perturbations, with improvements shown across pre-training, fine-tuning, and in-context learning without increasing inference costs.

Byte-level models

Reddit r/LocalLLaMA

Discusses whether byte-level tokenizers outperform subword tokenizers for precise tasks like distinguishing similar names, counting characters, and case sensitivity, and asks for current recommendations.