Cross-Tokenizer LLM Distillation through a Byte-Level Interface
Summary
This paper proposes Byte-Level Distillation (BLD), a simple method for cross-tokenizer knowledge transfer in language models by operating at a shared byte-level interface, achieving competitive or superior performance compared to more complex existing approaches across 1B-8B parameter models.
View Cached Full Text
Cached at: 04/20/26, 08:29 AM
Paper page - Cross-Tokenizer LLM Distillation through a Byte-Level Interface
Source: https://huggingface.co/papers/2604.07466
Abstract
Byte-Level Distillation enables cross-tokenizer knowledge transfer by operating at the byte level, achieving competitive performance compared to complex existing methods.
Cross-tokenizer distillation (https://huggingface.co/papers?q=Cross-tokenizer%20distillation)(CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: thebyte level (https://huggingface.co/papers?q=byte%20level). In more detail, we convert the teacher’s output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with–and on several benchmarks surpasses–significantly more sophisticated CTD methods, across a range ofdistillation tasks (https://huggingface.co/papers?q=distillation%20tasks)with models from 1B to 8B parameters. Our results suggest that thebyte level (https://huggingface.co/papers?q=byte%20level)is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.
View arXiv page (https://arxiv.org/abs/2604.07466)View PDF (https://arxiv.org/pdf/2604.07466)Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.07466)
Get this paper in your agent:
hf papers read 2604.07466
Don’t have the latest CLI?curl -LsSf https://hf.co/cli/install.sh | bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.07466 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.07466 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.07466 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation
X-Token introduces two loss formulations (P-KL and H-KL) to address failure modes in logit-based cross-tokenizer knowledge distillation, enabling a student model to learn from teachers with incompatible vocabularies and achieving state-of-the-art results on Llama-3.2-1B.
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
This paper investigates the impact of subword tokenization on LLM training efficiency and performance by conducting controlled byte-level pretraining experiments. It reveals key factors such as training throughput and the integration of subword boundaries as linguistic priors.
Fast Byte Latent Transformer
This paper introduces BLT Diffusion and speculative decoding techniques for byte-level language models to significantly reduce generation latency and memory bandwidth costs while maintaining quality.
@JulieKallini: Fast Byte Latent Transformer is accepted to ICML 2026! Byte-level LMs promise to free us from subword tokenizers, but d…
The Fast Byte Latent Transformer (BLT-D) has been accepted to ICML 2026, introducing a text diffusion method for parallel byte-level decoding to overcome the speed limitations of traditional byte-level language models.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
This paper introduces BitLM, a language model that uses bitwise continuous diffusion to generate multiple tokens in parallel, aiming to overcome the sequential bottleneck of traditional autoregressive generation while preserving causal structure.