I built a diffusion language model from scratch. It writes flawless sentences that mean nothing, and that is the interesting part.

Reddit r/AI_Agents Models

Summary

The author built Joey, a 170M parameter masked diffusion language model from scratch, trained on FineWeb-Edu and fine-tuned on DailyDialog, achieving fluent but incoherent sentences due to capacity limitations. The project highlights the differences from autoregressive LLMs and the lessons learned from building and debugging the system.

# Most LLMs predict the next token. Joey does not. GPT-style models are autoregressive: they generate left to right, one token at a time, each token conditioned on the ones before it. Joey belongs to a different family, **masked diffusion** (the MDLM / LLaDA line of work). Instead of writing left to right, it: 1. Starts from a sequence that is 100% `[MASK]` 2. Predicts **every** token in parallel 3. Keeps only the tokens it is most confident about 4. Re-masks the rest 5. Repeats until the whole sequence resolves That remasking loop (MaskGIT / LLaDA style) is also what kills the repetition collapse that naive single-pass samplers fall into. In one diagram: FineWeb-Edu ── BPE ──▶ packed token blocks │ mask each token w.p. t (forward process, fixed) │ bidirectional Transformer(+ t) (reverse process, learned) │ 1/t-weighted cross-entropy on masked positions │ (after training) all-[MASK] ──▶ predict · keep confident · re-mask rest ──▶ text (sampling) * **Forward process:** corrupt text by replacing tokens with `[MASK]` at a random rate `t`. * **Reverse process:** a bidirectional, timestep-conditioned Transformer predicts the originals. * **Loss:** cross-entropy on the masked positions only, `1/t`\-weighted (the MDLM objective). # The architecture |Property|Value| |:-|:-| |Parameters|\~170M| |Backbone|Bidirectional Transformer (no causal mask), timestep-conditioned| |`d_model` / layers / heads|1024 / 12 / 16| |Context length|256 tokens| |Vocabulary|16,384 (custom ByteLevel BPE + `[PAD] [BOS] [EOS] [MASK]`)| |MLP|4x GELU, pre-norm, weight-tied head| |Diffusion|Masked / absorbing-state (MDLM / LLaDA family)| Everything is hand-written: the 16K ByteLevel BPE tokenizer, the bidirectional timestep-conditioned Transformer, the diffusion loss, and the iterative-remasking sampler. No `Trainer`, no pretrained weights. Built test-first, with unit tests for every module. # Training |Stage|Details| |:-|:-| |Data|FineWeb-Edu, \~2B tokens, own 16K BPE tokenizer| |Base|A100-40GB, bf16 + EMA, cosine LR + warmup, 174K steps (\~6h), gradient accumulation, hours kill-switch| |Fine-tune|DailyDialog, response-only masking (LLaDA-style SFT)| |Sampler|Remasking (MaskGIT / LLaDA) + repetition penalty + top-p| # The honest result After base training plus conversational fine-tuning, here is an actual exchange: you> Hi! joey> Oh, I am right! It's my favorite, we have always been there for a long time... you> Do you like music? joey> I don't know that much. But I think there is no one... Joey greets correctly, forms grammatical sentences, and holds a conversational register. It is fluent but not yet truly coherent: correct local grammar without sustained global meaning. That is not a bug I gave up on. It is the signature of a capacity ceiling. At 170M parameters the model had essentially converged for its size. It learned *how* language sounds before it had the room to learn *what* to actually say. Getting to genuine coherence is primarily a scale problem (more parameters and tokens), and that is the next milestone. # What actually broke, and what it taught me The two failures I learned the most from: * **CUDA OOM** during training, which forced me to actually understand memory layout, gradient accumulation, and batch packing instead of copying a config. * **Repetition collapse** in sampling, which is where the remasking strategy earns its keep. Naive single-shot decoding loops on itself. Predicting all tokens, keeping only the confident ones, and re-masking the rest breaks the loop. You do not really understand diffusion LLMs until you have debugged your own OOM at 2am and watched a loss curve flatten in front of you. No paper or course gets you there. Building the broken version did. # Roadmap * \[x\] From-scratch tokenizer, model, diffusion loss, sampler, training loop * \[x\] Base pretraining on \~2B tokens + conversational SFT * \[x\] Remasking sampler to eliminate repetition loops * \[ \] Scale up (\~400M to 1B) for real coherence, in progress * \[ \] Larger, cleaner instruction-tuning data * \[ \] Classifier-free guidance for conditional sampling * \[ \] Longer context # Code and weights * Link in comments Built on the shoulders of MDLM (Sahoo et al., 2024), LLaDA (Nie et al., 2025), D3PM (Austin et al., 2021), SEDD (Lou et al., 2024), and MaskGIT (Chang et al., 2022). If you have worked with discrete diffusion for text, I would love to hear how you think about the autoregressive vs diffusion tradeoff, especially whether the parallel-decoding speed wins actually survive at scale.
Original Article

Similar Articles

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Hugging Face Daily Papers

LangFlow presents the first continuous diffusion language model that rivals discrete diffusion approaches, challenging the long-held belief that continuous diffusion is inferior for language modeling. The work introduces key ingredients like optimal Gumbel-based noise scheduling and demonstrates competitive perplexity and transfer learning performance compared to discrete diffusion baselines.