Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Hugging Face Daily Papers 04/13/26, 12:00 AM Papers

Summary

This paper proposes Byte-Level Distillation (BLD), a simple method for cross-tokenizer knowledge transfer in language models by operating at a shared byte-level interface, achieving competitive or superior performance compared to more complex existing approaches across 1B-8B parameter models.

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:29 AM

Paper page - Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Source: https://huggingface.co/papers/2604.07466

Abstract

Byte-Level Distillation enables cross-tokenizer knowledge transfer by operating at the byte level, achieving competitive performance compared to complex existing methods.

Cross-tokenizer distillation (https://huggingface.co/papers?q=Cross-tokenizer%20distillation)(CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: thebyte level (https://huggingface.co/papers?q=byte%20level). In more detail, we convert the teacher’s output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with–and on several benchmarks surpasses–significantly more sophisticated CTD methods, across a range ofdistillation tasks (https://huggingface.co/papers?q=distillation%20tasks)with models from 1B to 8B parameters. Our results suggest that thebyte level (https://huggingface.co/papers?q=byte%20level)is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

View arXiv page (https://arxiv.org/abs/2604.07466)View PDF (https://arxiv.org/pdf/2604.07466)Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.07466)

Get this paper in your agent:

hf papers read 2604.07466

Don’t have the latest CLI?curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.07466 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.07466 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.07466 in a Space README.md to link it from this page.

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Paper page - Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

Fast Byte Latent Transformer

@JulieKallini: Fast Byte Latent Transformer is accepted to ICML 2026! Byte-level LMs promise to free us from subword tokenizers, but d…

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

Submit Feedback

Similar Articles

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

@JulieKallini: Fast Byte Latent Transformer is accepted to ICML 2026! Byte-level LMs promise to free us from subword tokenizers, but d…

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion