Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Hugging Face Daily Papers Papers

Summary

This paper proposes Byte-Level Distillation (BLD), a simple method for cross-tokenizer knowledge transfer in language models by operating at a shared byte-level interface, achieving competitive or superior performance compared to more complex existing approaches across 1B-8B parameter models.

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:29 AM

Paper page - Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Source: https://huggingface.co/papers/2604.07466

Abstract

Byte-Level Distillation enables cross-tokenizer knowledge transfer by operating at the byte level, achieving competitive performance compared to complex existing methods.

Cross-tokenizer distillation (https://huggingface.co/papers?q=Cross-tokenizer%20distillation)(CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: thebyte level (https://huggingface.co/papers?q=byte%20level). In more detail, we convert the teacher’s output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with–and on several benchmarks surpasses–significantly more sophisticated CTD methods, across a range ofdistillation tasks (https://huggingface.co/papers?q=distillation%20tasks)with models from 1B to 8B parameters. Our results suggest that thebyte level (https://huggingface.co/papers?q=byte%20level)is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

View arXiv page (https://arxiv.org/abs/2604.07466)View PDF (https://arxiv.org/pdf/2604.07466)Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.07466)

Get this paper in your agent:

hf papers read 2604.07466

Don’t have the latest CLI?curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.07466 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.07466 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.07466 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

arXiv cs.LG

X-Token introduces two loss formulations (P-KL and H-KL) to address failure modes in logit-based cross-tokenizer knowledge distillation, enabling a student model to learn from teachers with incompatible vocabularies and achieving state-of-the-art results on Llama-3.2-1B.

Fast Byte Latent Transformer

Hugging Face Daily Papers

This paper introduces BLT Diffusion and speculative decoding techniques for byte-level language models to significantly reduce generation latency and memory bandwidth costs while maintaining quality.