Compute Optimal Tokenization (2 minute read)
Summary
This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.
View Cached Full Text
Cached at: 05/14/26, 12:10 AM
Similar Articles
Finding Optimal Tokenizers
This blog post presents an algorithm using integer linear programming to compute optimal tokenizers for language models, drawing parallels to solving the Traveling Salesman Problem. It notes that while the result is theoretically interesting, practical tokenizers are already near-optimal and the method may not generalize well.
Token maxxing
Discusses strategies and techniques for maximizing token usage in large language models to improve efficiency and output quality.
Stochasticity in Tokenization Improves Robustness
This paper demonstrates that training large language models with stochastic tokenization instead of deterministic canonical tokenization significantly improves robustness to adversarial attacks and random perturbations, with improvements shown across pre-training, fine-tuning, and in-context learning without increasing inference costs.
Balancing Image Compression and Generation with Bootstrapped Tokenization
Introduces SelfBootTok, a self-bootstrapped tokenization method that separates global and local information, reducing generator computation by ~40% and achieving a new state-of-the-art gFID of 1.56 with only 64 tokens.
Byte-level models
Discusses whether byte-level tokenizers outperform subword tokenizers for precise tasks like distinguishing similar names, counting characters, and case sensitivity, and asks for current recommendations.