Byte-level models
Summary
Discusses whether byte-level tokenizers outperform subword tokenizers for precise tasks like distinguishing similar names, counting characters, and case sensitivity, and asks for current recommendations.
Similar Articles
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
This paper investigates the impact of subword tokenization on LLM training efficiency and performance by conducting controlled byte-level pretraining experiments. It reveals key factors such as training throughput and the integration of subword boundaries as linguistic priors.
Cross-Tokenizer LLM Distillation through a Byte-Level Interface
This paper proposes Byte-Level Distillation (BLD), a simple method for cross-tokenizer knowledge transfer in language models by operating at a shared byte-level interface, achieving competitive or superior performance compared to more complex existing approaches across 1B-8B parameter models.
Compute Optimal Tokenization (2 minute read)
This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.
Finding Optimal Tokenizers
This blog post presents an algorithm using integer linear programming to compute optimal tokenizers for language models, drawing parallels to solving the Traveling Salesman Problem. It notes that while the result is theoretically interesting, practical tokenizers are already near-optimal and the method may not generalize well.
Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models
This paper systematically compares equitable tokenizers for multilingual LLMs across 11 Southeast Asian languages, finding that Parity-aware BPE achieves the best efficiency-equity trade-off and that cross-lingual fairness and tokenization efficiency are not fundamentally at odds.