TextLDM: Language Modeling with Continuous Latent Diffusion

Hugging Face Daily Papers Papers

Summary

This paper introduces TextLDM, a method that adapts visual latent diffusion transformers for language modeling by mapping discrete tokens to continuous latents. It demonstrates that this approach, enhanced by representation alignment, matches GPT-2 performance and unifies visual and text generation architectures.

Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework to language modeling. We propose TextLDM, which transfers the visual latent diffusion recipe to text generation with minimal architectural modification. A Transformer-based VAE maps discrete tokens to continuous latents, enhanced by Representation Alignment (REPA) with a frozen pretrained language model to produce representations effective for conditional denoising. A standard DiT then performs flow matching in this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model via REPA is critical for downstream generation quality. Trained from scratch on OpenWebText2, TextLDM substantially outperforms prior diffusion language models and matches GPT-2 under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:20 AM

Paper page - TextLDM: Language Modeling with Continuous Latent Diffusion

Source: https://huggingface.co/papers/2605.07748 Authors:

,

,

,

,

,

,

,

,

,

Abstract

TextLDM adapts visual latent diffusion transformers to language modeling by mapping discrete tokens to continuous latents and using representation alignment for improved text generation quality.

Diffusion Transformers(DiT) trained withflow matchingin aVAE latent spacehave unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework tolanguage modeling. We propose TextLDM, which transfers the visual latent diffusion recipe totext generationwith minimal architectural modification. ATransformer-based VAEmaps discrete tokens to continuous latents, enhanced byRepresentation Alignment(REPA) with a frozen pretrained language model to produce representations effective forconditional denoising. A standard DiT then performsflow matchingin this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model viaREPAis critical for downstream generation quality. Trained from scratch onOpenWebText2, TextLDM substantially outperforms prior diffusion language models and matchesGPT-2under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.07748

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.07748 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.07748 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.07748 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Hugging Face Daily Papers

LangFlow presents the first continuous diffusion language model that rivals discrete diffusion approaches, challenging the long-held belief that continuous diffusion is inferior for language modeling. The work introduces key ingredients like optimal Gumbel-based noise scheduling and demonstrates competitive perplexity and transfer learning performance compared to discrete diffusion baselines.

Continuous Latent Diffusion Language Model

Hugging Face Daily Papers

Cola DLM is a hierarchical latent diffusion language model that uses text-to-latent mapping and conditional decoding to achieve efficient, non-autoregressive text generation.

Discrete Stochastic Localization for Non-autoregressive Generation

arXiv cs.LG

Introduces Discrete Stochastic Localization (DSL), a continuous-state diffusion framework for non-autoregressive text generation that uses unit-sphere token embeddings and a timestep-invariant denoiser, achieving better distributional faithfulness than masked discrete diffusion models on OpenWebText.