Byte-level models

Reddit r/LocalLLaMA News

Summary

Discusses whether byte-level tokenizers outperform subword tokenizers for precise tasks like distinguishing similar names, counting characters, and case sensitivity, and asks for current recommendations.

How helpful are byte tokenizers and decoders compared to subword tokenizers for precise tasks today? Do they have genuinely better results distinguishing between small differences in similar names and words without being confused (eg Jansen vs Jensen), counting characters, distinguishing between uppercase and lowercase letters, or “skipping” data in summaries? If they do help for fine-grained tasks, which is the current favorite?
Original Article

Similar Articles

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Hugging Face Daily Papers

This paper proposes Byte-Level Distillation (BLD), a simple method for cross-tokenizer knowledge transfer in language models by operating at a shared byte-level interface, achieving competitive or superior performance compared to more complex existing approaches across 1B-8B parameter models.

Compute Optimal Tokenization (2 minute read)

TLDR AI

This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.

Finding Optimal Tokenizers

Hacker News Top

This blog post presents an algorithm using integer linear programming to compute optimal tokenizers for language models, drawing parallels to solving the Traveling Salesman Problem. It notes that while the result is theoretically interesting, practical tokenizers are already near-optimal and the method may not generalize well.