Fast Byte Latent Transformer

Hugging Face Daily Papers 05/08/26, 12:00 AM Papers

byte-level-lm diffusion speculative-decoding inference-optimization transformer arxiv

Summary

This paper introduces BLT Diffusion and speculative decoding techniques for byte-level language models to significantly reduce generation latency and memory bandwidth costs while maintaining quality.

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.

Original Article

View Cached Full Text

Cached at: 05/11/26, 02:42 AM

Paper page - Fast Byte Latent Transformer

Source: https://huggingface.co/papers/2605.08044

Abstract

Byte-level language models overcome slow autoregressive generation through diffusion-based parallel processing and speculative decoding techniques that improve both speed and quality.

Recentbyte-level language models(LMs) match the performance oftoken-level modelswithout relying onsubword vocabularies, yet their utility is limited by slow, byte-by-byteautoregressive generation. We address this bottleneck in theByte Latent Transformer(BLT) through new training and generation techniques. First, we introduceBLT Diffusion(BLT-D), a new model and our fastest BLT variant, trained with an auxiliaryblock-wise diffusion objectivealongside the standardnext-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired byspeculative decodingthat trade some of this speed for higher generation quality:BLT Self-speculation(BLT-S), in which BLT’s local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; andBLT Diffusion+Verification(BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimatedmemory-bandwidth costover 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.08044

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.08044 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.08044 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.08044 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Fast Byte Latent Transformer

Paper page - Fast Byte Latent Transformer

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

@JulieKallini: Fast Byte Latent Transformer is accepted to ICML 2026! Byte-level LMs promise to free us from subword tokenizers, but d…

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM

Supportive Token Revealing for Fast Diffusion Language Model Decoding

Submit Feedback

Similar Articles

@JulieKallini: Fast Byte Latent Transformer is accepted to ICML 2026! Byte-level LMs promise to free us from subword tokenizers, but d…

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM

Supportive Token Revealing for Fast Diffusion Language Model Decoding