TextLDM: Language Modeling with Continuous Latent Diffusion
Summary
This paper introduces TextLDM, a method that adapts visual latent diffusion transformers for language modeling by mapping discrete tokens to continuous latents. It demonstrates that this approach, enhanced by representation alignment, matches GPT-2 performance and unifies visual and text generation architectures.
View Cached Full Text
Cached at: 05/11/26, 07:20 AM
Paper page - TextLDM: Language Modeling with Continuous Latent Diffusion
Source: https://huggingface.co/papers/2605.07748 Authors:
,
,
,
,
,
,
,
,
,
Abstract
TextLDM adapts visual latent diffusion transformers to language modeling by mapping discrete tokens to continuous latents and using representation alignment for improved text generation quality.
Diffusion Transformers(DiT) trained withflow matchingin aVAE latent spacehave unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework tolanguage modeling. We propose TextLDM, which transfers the visual latent diffusion recipe totext generationwith minimal architectural modification. ATransformer-based VAEmaps discrete tokens to continuous latents, enhanced byRepresentation Alignment(REPA) with a frozen pretrained language model to produce representations effective forconditional denoising. A standard DiT then performsflow matchingin this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model viaREPAis critical for downstream generation quality. Trained from scratch onOpenWebText2, TextLDM substantially outperforms prior diffusion language models and matchesGPT-2under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.07748
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.07748 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.07748 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.07748 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
LangFlow presents the first continuous diffusion language model that rivals discrete diffusion approaches, challenging the long-held belief that continuous diffusion is inferior for language modeling. The work introduces key ingredients like optimal Gumbel-based noise scheduling and demonstrates competitive perplexity and transfer learning performance compared to discrete diffusion baselines.
Continuous Latent Diffusion Language Model
Cola DLM is a hierarchical latent diffusion language model that uses text-to-latent mapping and conditional decoding to achieve efficient, non-autoregressive text generation.
Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion
This paper introduces a diffusion language model that treats text as a continuous process over binary bitstreams, using entropy-gated stochastic sampling to close the performance gap with autoregressive models. It achieves state-of-the-art results on LM1B and OWT benchmarks while reducing memory footprint.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
This paper introduces BitLM, a language model that uses bitwise continuous diffusion to generate multiple tokens in parallel, aiming to overcome the sequential bottleneck of traditional autoregressive generation while preserving causal structure.
Discrete Stochastic Localization for Non-autoregressive Generation
Introduces Discrete Stochastic Localization (DSL), a continuous-state diffusion framework for non-autoregressive text generation that uses unit-sphere token embeddings and a timestep-invariant denoiser, achieving better distributional faithfulness than masked discrete diffusion models on OpenWebText.