Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion

arXiv cs.CL Papers

Summary

This paper introduces a diffusion language model that treats text as a continuous process over binary bitstreams, using entropy-gated stochastic sampling to close the performance gap with autoregressive models. It achieves state-of-the-art results on LM1B and OWT benchmarks while reducing memory footprint.

arXiv:2605.07013v1 Announce Type: new Abstract: Diffusion language models (DLMs) promise parallel, order-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion approaches over token embeddings have narrowed this gap, suggesting continuous state spaces are highly effective for language. In this work, we further close the autoregressive gap by modeling text as a continuous diffusion process over fixed-width binary bitstreams. Our approach represents semantic tokens as analog bit sequences and utilizes a matched-filter residual parameterization to isolate contextual learning from analytic independent-bit posteriors. Crucially, we adopt a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile, automatically concentrating stochasticity in high-information regions while remaining nearly deterministic elsewhere. On the One Billion Word Benchmark (LM1B), our 130M-parameter bitstream model reaches a generative perplexity ($\GenPPL$) of $59.76$ at matched real-data entropy ($4.31$) using 256 neural function evaluations (NFEs), decisively outperforming prior DLM baselines and reaching the autoregressive reference. On OpenWebText (OWT), our stochastic sampler establishes a new continuous-DLM Pareto frontier, achieving $\GenPPL=27.06$ at an entropy of $5.26$ using $4\times$ fewer steps than previous 1024-NFE baselines. As an additional architectural benefit, bitstream diffusion removes the $\mathcal{O}(V)$ vocabulary scaling bottleneck shared by standard DLMs. By predicting $\mathcal{O}(\log V)$ bitwise logits via semantic bit-patching, our model yields a reduced memory footprint and higher throughput, demonstrating a scalable paradigm for language generation as vocabulary sizes grow.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/11/26, 06:42 AM

# Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion
Source: [https://arxiv.org/html/2605.07013](https://arxiv.org/html/2605.07013)
###### Abstract

Diffusion language models \(DLMs\) promise parallel, order\-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity\. Recent continuous flow and diffusion approaches over token embeddings have narrowed this gap, suggesting continuous state spaces are highly effective for language\. In this work, we further close the autoregressive gap by modeling text as a continuous diffusion process over fixed\-width binary bitstreams\. Our approach represents semantic tokens as analog bit sequences and utilizes a matched\-filter residual parameterization to isolate contextual learning from analytic independent\-bit posteriors\. Crucially, we adopt a stochastic sampler that applies Langevin\-type corrections gated by the entropy\-rate profile, automatically concentrating stochasticity in high\-information regions while remaining nearly deterministic elsewhere\. On the One Billion Word Benchmark \(LM1B\), our 130M\-parameter bitstream model reaches a generative perplexity \(Gen\.PPL\\mathrm\{Gen\.\\ PPL\}\) of59\.7659\.76at matched real\-data entropy \(4\.314\.31\) using 256 neural function evaluations \(NFEs\), decisively outperforming prior DLM baselines and reaching the autoregressive reference\. On OpenWebText \(OWT\), our stochastic sampler establishes a new continuous\-DLM Pareto frontier, achievingGen\.PPL=27\.06\\mathrm\{Gen\.\\ PPL\}=27\.06at an entropy of5\.265\.26using4×4\\timesfewer steps than previous 1024\-NFE baselines\. As an additional architectural benefit, bitstream diffusion removes the𝒪​\(V\)\\mathcal\{O\}\(V\)vocabulary scaling bottleneck shared by standard DLMs\. By predicting𝒪​\(log⁡V\)\\mathcal\{O\}\(\\log V\)bitwise logits via semantic bit\-patching, our model yields a reduced memory footprint and higher throughput, demonstrating a scalable paradigm for language generation as vocabulary sizes grow\.

††footnotetext:Affiliations\.1Department of Engineering, University of Cambridge\.2Devotion AI Labs Ltd\.3Donders Institute for Brain, Cognition, and Behaviour, Radboud University\.## 1Introduction

Autoregressive language models dominate modern text generation because they define a simple factorization and scale reliably\. Their main algorithmic limitation is equally clear: generation is inherently sequential\. Continuous diffusion models have become a standard framework for high\-dimensional image and video generation\(Hoet al\.,[2020](https://arxiv.org/html/2605.07013#bib.bib1); Songet al\.,[2021](https://arxiv.org/html/2605.07013#bib.bib2); Karraset al\.,[2022a](https://arxiv.org/html/2605.07013#bib.bib3)\)\. Diffusion language models \(DLMs\) offer a compelling alternative compute profile, enabling parallel refinement of all positions, arbitrary infilling, and a tunable compute, quality tradeoff\. However, unlocking these advantages has proven difficult\. For language, diffusion models have historically suffered from a persistent quality, diversity gap compared to autoregressive baselines: they often yield weak sample quality, or achieve artificially low generative perplexity \(GenPPL\) only by over\-generating safe, frequent tokens, thereby collapsing sample entropy\.

A common explanation for this gap has been that language naturally demands discrete diffusion\. This view has been challenged by recent continuous flow and diffusion language models over one\-hot token embeddings\(Rooset al\.,[2026](https://arxiv.org/html/2605.07013#bib.bib16); Leeet al\.,[2026](https://arxiv.org/html/2605.07013#bib.bib13); Chenet al\.,[2026](https://arxiv.org/html/2605.07013#bib.bib14)\)\. These methods demonstrate that continuous models can closely rival strong discrete baselines, suggesting the primary bottleneck is not continuity itself, but the interaction between the state representation, objective, and sampler design\.

This paper pushes the continuous diffusion paradigm further towards closing the gap to autoregression and establishing a new state\-of\-the\-art DLM Pareto frontier\. Instead of diffusing over token embeddings, following the approach introduced in\(Chenet al\.,[2023](https://arxiv.org/html/2605.07013#bib.bib6)\)for multimodal generation, we diffuse over*bitstreams*\. A sequence ofTTsemantic tokens is encoded as a sequence of fixed\-width binary bits, embedded in continuous space, and an EDM\-style denoiser is trained to recover the bits from Gaussian corruption\. Because the posterior of an isolated bit under Gaussian corruption has a known closed form, we introduce a*matched\-filter residual parameterization*: the network analytically computes the independent\-bit posterior and focuses its capacity entirely on predicting the contextual residual\.

The critical role of stochastic sampling\.While the bitstream representation provides a strong foundation, our largest empirical gain comes from the sampler\. The deterministic probability\-flow sampler is already competitive with recent continuous DLMs, but it is over\-contractive: it can obtain good GenPPL by undershooting real\-data token entropy\. We show that EDM\-style stochastic churn corrects this behavior\. When applied on an entropy\-rate sampling grid, full\-band churn improves the GenPPL, entropy frontier without changing the trained model or increasing the NFE budget\. Intuitively, the entropy\-rate grid concentrates solver resolution where bit uncertainty is resolved, and the same grid also makes the effective stochastic correction strongest in that region\. We formalize this continuous\-time interpretation in[Appendix˜D](https://arxiv.org/html/2605.07013#A4)\.

Empirical status\.As shown in[Figure˜2](https://arxiv.org/html/2605.07013#S4.F2), full\-band stochasticity is the primary driver of our improved GenPPL–entropy Pareto frontier\. On LM1B, our deterministic 256\-NFE sampler achievesGen\.PPL=82\.90\\mathrm\{Gen\.\\ PPL\}=82\.90at entropy4\.304\.30, remaining competitive with prior continuous DLMs\. Applying full\-band stochasticity shifts the frontier toGen\.PPL=59\.76\\mathrm\{Gen\.\\ PPL\}=\\mathbf\{59\.76\}at matched real\-data entropy4\.314\.31\. This improves over LangFlow’s reportedGen\.PPL=92\.24\\mathrm\{Gen\.\\ PPL\}=92\.24by more than 30 points and reaches the autoregressive quality regime, outperforming the AR baseline’sGen\.PPL=66\.70\\mathrm\{Gen\.\\ PPL\}=66\.70while preserving healthy token entropy\. On OpenWebText, our method similarly shifts the high\-quality region of the continuous\-DLM frontier: the deterministic sampler obtainsGen\.PPL=46\.32\\mathrm\{Gen\.\\ PPL\}=46\.32at entropy5\.135\.13, while full\-band stochasticity improves this toGen\.PPL=27\.06\\mathrm\{Gen\.\\ PPL\}=\\mathbf\{27\.06\}at entropy5\.265\.26\. This outperforms LangFlow’sGen\.PPL=36\.53\\mathrm\{Gen\.\\ PPL\}=36\.53at slightly lower entropy5\.255\.25, using 256 rather than 1024 NFEs\.

Additional computational benefits\.As an important structural benefit, bitstream diffusion eliminates the vocabulary\-sized output bottleneck shared by almost all DLMs\. Simplex, one\-hot, and discrete\-transition models fundamentally require𝒪​\(V\)\\mathcal\{O\}\(V\)output parameterizations per token\. By patchingm=⌈log2⁡V⌉m=\\lceil\\log\_\{2\}V\\rceilbits into a single sequence element, our sequence diffusion transformer \(SDT\) preserves the semantic context lengthTTwhile replacing the dense vocabulary classifier with a compact𝒪​\(log⁡V\)\\mathcal\{O\}\(\\log V\)bitwise head\. At LM1B scale, this end\-to\-end simplification yields a1\.6×1\.6\\timesreduction in peak memory and2\.3×2\.3\\timeshigher training throughput compared to a matched token\-space model\. Crucially, as context length and vocabulary size grow to OpenWebText scale, these practical gains expand to a3\.3×3\.3\\timestraining throughput speedup and a19×19\\timesreduction in generation memory footprint\. This demonstrates that the architectural advantage becomes strictly more beneficial as sequences and vocabularies grow\.

## 2Related Work

Discrete diffusion language models\.Discrete DLMs define Markov corruption processes directly on tokens or masks\. D3PM\(Austinet al\.,[2021](https://arxiv.org/html/2605.07013#bib.bib5)\)introduced structured discrete denoising diffusion, and SEDD\(Louet al\.,[2023](https://arxiv.org/html/2605.07013#bib.bib8)\)framed discrete diffusion through ratio estimation and score entropy\. MDLM\(Sahooet al\.,[2024](https://arxiv.org/html/2605.07013#bib.bib9)\)showed that masked diffusion language modeling can be substantially strengthened with a simplified objective and improved training recipe\. Duo\(Sahoo and others,[2025](https://arxiv.org/html/2605.07013#bib.bib10)\)further connected uniform\-state discrete diffusion to Gaussian diffusion and improved training and sampling\. These methods are strong baselines for non\-autoregressive language modeling, but they remain tied to categorical transition kernels and often require separate designs for masking, absorbing states, or uniform corruption\. Our approach instead uses a continuous Gaussian process in bit space, while still decoding to valid discrete tokens at the end\.

Continuous diffusion for categorical data\.Several works have argued that continuous diffusion can be useful for discrete variables when the geometry is handled carefully\. Analog Bits\(Chenet al\.,[2023](https://arxiv.org/html/2605.07013#bib.bib6)\)represents discrete variables as binary bits and trains continuous diffusion models on analog versions of those bits; it also introduced self\-conditioning and asymmetric time intervals, both of which influence our design\. CDCD\(Dielemanet al\.,[2022](https://arxiv.org/html/2605.07013#bib.bib7)\)models categorical data with continuous\-time and continuous\-state diffusion, and emphasizes that categorical geometry should inform the parameterization and objective\. Riemannian Diffusion Language Models \(RDLM\) model categorical distributions using statistical\-manifold geometry\(Jo and Hwang,[2025](https://arxiv.org/html/2605.07013#bib.bib11)\)\. CANDI\(Pynadathet al\.,[2025](https://arxiv.org/html/2605.07013#bib.bib12)\)explores hybrid discrete–continuous diffusion\. Our contribution is closest in representation to Analog Bits but differs in scale, architecture, language\-focused evaluation, matched\-filter residual parameterization, entropy\-rate scheduling, and entropy\-band stochastic sampling\. Recently, FLM/FMLM\(Leeet al\.,[2026](https://arxiv.org/html/2605.07013#bib.bib13)\)and LangFlow\(Chenet al\.,[2026](https://arxiv.org/html/2605.07013#bib.bib14)\)provide the strongest recent evidence that continuous language models can rival discrete DLMs\. FLM performs continuous denoising over one\-hot token encodings and trains with cross\-entropy objectives; FMLM distills the flow map for few\-step generation\. LangFlow connects embedding\-space DLMs to flow matching via Bregman divergence, introduces an ODE\-based NLL bound, proposes an information\-uniform noise\-scheduling principle, and shows that self\-conditioning improves continuous DLMs\. Our method validates the same broad thesis, continuous generative modeling can work well for language, but takes a different route\. We use fixed\-width bitstreams instead of one\-hot token embeddings, binary score matching instead of token cross\-entropy as the default objective, and an analytic matched\-filter decomposition of bit posterior logits\.

## 3Method

A text example is tokenized intoTTsemantic tokens and encoded as a fixed\-width binary sequence

x0∈\{0,1\}S,S=T​m,x\_\{0\}\\in\\\{0,1\\\}^\{S\},\\qquad S=Tm,wheremmis the number of bits per token or code token\. For LM1B we useT=128T=128andm=15m=15; for OpenWebText we use a fixed 1024\-code\-token representation withm=16m=16\. The model never generates autoregressively: it initializes an analog bit vector from Gaussian noise and denoises all positions in parallel\.[Figure˜1](https://arxiv.org/html/2605.07013#S3.F1)gives an end\-to\-end schematic of the architecture\.

### 3\.1Architecture schematic

LengthT​mTmToken: “The”Token: “cat”⋯\\cdotsToken: “sat”0110\.\.\.101001\.\.\.01⋯\\cdots0011\.\.\.11m=15m=15m=15m=15m=15m=15Clean Bitsx0∈\{0,1\}B×T​mx\_\{0\}\\in\\\{0,1\\\}^\{B\\times Tm\}0\.82 …\-0\.141\.05 …0\.91⋯\\cdots\-0\.22 …0\.05Diffused Bitsxσ∈ℝB×T​mx\_\{\\sigma\}\\in\\mathbb\{R\}^\{B\\times Tm\}Gaussian Corruption⊕σ​ϵ\\oplus\\ \\sigma\\epsilonPatch 1Patch 2⋯\\cdotsPatchTTPatch Adapter \(Linear\)→ℝB×T×d\\to\\mathbb\{R\}^\{B\\times T\\times d\}Timeσ\\sigma& Pos EmbSequence Diffusion Transformer Trunk12 Blocks∣\\midProcesses Semantic LengthTTToken EmbeddingsAnalytic Matched Filterclip⁡\(xσ−1/2σ2\)\\operatorname\{clip\}\\left\(\\frac\{x\_\{\\sigma\}\-1/2\}\{\\sigma^\{2\}\}\\right\)Optimal Skip MLP\(Predicts Residualrθr\_\{\\theta\}\)Global PatchContextB×TB\\times TSkip ConnectionLocal Noisy\-Bits\+\+Total Logitℓθ∈ℝB×T​m\\ell\_\{\\theta\}\\in\\mathbb\{R\}^\{B\\times Tm\}

Figure 1:End\-to\-end Bitstream Diffusion architecture\.Text is encoded into aT​mTm\-length bitstream and corrupted via Gaussian noise\. The diffused bitsxσx\_\{\\sigma\}are patched into semantic groups and processed by a length\-TTTransformer trunk\. In parallel with the trunk, a skip connection routes the raw noisy bits directly to the head, where a contextual residual is combined with the analytic matched filter to produce bit\-level logits\.As forward process, we use a variance\-exploding Gaussian corruption model

xσ=x0\+σ​ϵ,ϵ∼𝒩​\(0,IS\),σ∈\[σmin,σmax\]\.x\_\{\\sigma\}=x\_\{0\}\+\\sigma\\epsilon,\\qquad\\epsilon\\sim\\mathcal\{N\}\(0,I\_\{S\}\),\\qquad\\sigma\\in\[\\sigma\_\{\\min\},\\sigma\_\{\\max\}\]\.\(1\)Bits are represented as0/10/1values, with data centerc=1/2c=1/2andσdata=1/2\\sigma\_\{\\mathrm\{data\}\}=1/2\. The denoiser predicts bitwise clean probabilities

Dθ​\(xσ,σ\)∈\(0,1\)S\.D\_\{\\theta\}\(x\_\{\\sigma\},\\sigma\)\\in\(0,1\)^\{S\}\.These probabilities induce a continuous score estimate through the Gaussian posterior\-mean identity

sθ​\(xσ,σ\)=Dθ​\(xσ,σ\)−xσσ2\.s\_\{\\theta\}\(x\_\{\\sigma\},\\sigma\)=\\frac\{D\_\{\\theta\}\(x\_\{\\sigma\},\\sigma\)\-x\_\{\\sigma\}\}\{\\sigma^\{2\}\}\.\(2\)Thus the same output defines discrete bit probabilities and provides the score used by the continuous sampler\.

### 3\.2Matched\-filter residual parameterization

A central modeling choice is to separate local Gaussian bit denoising from contextual language modeling\. For an isolated bit with uniform priorx0∼Bern​\(1/2\)x\_\{0\}\\sim\\mathrm\{Bern\}\(1/2\)and observationx=x0\+σ​ϵx=x\_\{0\}\+\\sigma\\epsilon, the posterior logit is available analytically:

ℓind​\(x,σ\)=log⁡p​\(x0=1∣x,σ\)p​\(x0=0∣x,σ\)=x−12σ2\.\\ell\_\{\\mathrm\{ind\}\}\(x,\\sigma\)=\\log\\frac\{p\(x\_\{0\}=1\\mid x,\\sigma\)\}\{p\(x\_\{0\}=0\\mid x,\\sigma\)\}=\\frac\{x\-\\frac\{1\}\{2\}\}\{\\sigma^\{2\}\}\.\(3\)Language modeling, however, requires contextual dependencies between bits, tokens, and distant positions\. We therefore ask the network to predict only a contextual residual logitrθr\_\{\\theta\}, and add it to the analytic matched filter:

ℓθ​\(xσ,σ\)=rθ​\(xσ,σ,xsc\)\+clip⁡\(xσ−12σ2,−C,C\),Dθ=sigmoid⁡\(ℓθ\)\.\\ell\_\{\\theta\}\(x\_\{\\sigma\},\\sigma\)=r\_\{\\theta\}\(x\_\{\\sigma\},\\sigma,x\_\{\\mathrm\{sc\}\}\)\+\\operatorname\{clip\}\\\!\\left\(\\frac\{x\_\{\\sigma\}\-\\frac\{1\}\{2\}\}\{\\sigma^\{2\}\},\-C,C\\right\),\\qquad D\_\{\\theta\}=\\operatorname\{sigmoid\}\(\\ell\_\{\\theta\}\)\.\(4\)Herexscx\_\{\\mathrm\{sc\}\}is the self\-conditioning input andC=30C=30in our runs\. The analytic term handles local Gaussian bit denoising, while the network focuses on contextual dependencies between bits and tokens\. This is the strongest training\-side component in[Section˜4\.4](https://arxiv.org/html/2605.07013#S4.SS4): removing it worsens LM1B GenPPL by over 22 points under a fixed deterministic sampler\. A derivation of[Equation˜3](https://arxiv.org/html/2605.07013#S3.E3)is given in[Section˜A\.1](https://arxiv.org/html/2605.07013#A1.SS1)\.

### 3\.3Training objective and self\-conditioning

We train with binary score interpolation, implemented as a weighted denoising MSE between posterior probabilities and clean bits:

ℒSM​\(θ\)=𝔼x0,ϵ,σ​\[w​\(σ\)​1S​∑i=1S\(Dθ,i​\(xσ,σ\)−x0,i\)2\],\\mathcal\{L\}\_\{\\mathrm\{SM\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\_\{0\},\\epsilon,\\sigma\}\\left\[w\(\\sigma\)\\,\\frac\{1\}\{S\}\\sum\_\{i=1\}^\{S\}\\left\(D\_\{\\theta,i\}\(x\_\{\\sigma\},\\sigma\)\-x\_\{0,i\}\\right\)^\{2\}\\right\],\(5\)with EDM weighting

w​\(σ\)=σ2\+σdata2σ2​σdata2\.w\(\\sigma\)=\\frac\{\\sigma^\{2\}\+\\sigma\_\{\\mathrm\{data\}\}^\{2\}\}\{\\sigma^\{2\}\\sigma\_\{\\mathrm\{data\}\}^\{2\}\}\.\(6\)Recent continuous token\-space language models such as FLM and LangFlow use token\-level objectives based on cross\-entropy or related divergences\(Leeet al\.,[2026](https://arxiv.org/html/2605.07013#bib.bib13); Chenet al\.,[2026](https://arxiv.org/html/2605.07013#bib.bib14)\)\. In our binary setting, score matching is natural becauseDθD\_\{\\theta\}directly defines the continuous score in[Equation˜2](https://arxiv.org/html/2605.07013#S3.E2)\.

We use self\-conditioning by default\. With probabilitypsc=0\.5p\_\{\\mathrm\{sc\}\}=0\.5during training, we perform an auxiliary no\-gradient denoising pass and feed the detached posterior probabilities back into the main denoiser asxscx\_\{\\mathrm\{sc\}\}\. At sampling time, we use carry\-mode self\-conditioning, where the previous denoised prediction is fed into the next denoising step\. This follows prior continuous bitstream and language diffusion practice\(Chenet al\.,[2023](https://arxiv.org/html/2605.07013#bib.bib6),[2026](https://arxiv.org/html/2605.07013#bib.bib14)\)\.

### 3\.4Entropy\-rate adaptive noise allocation

The standard EDM log\-normal noise distribution is a strong generic choice, but bitstream data can have information profiles that depend on the tokenizer, code representation, redundancy, and dataset\. We therefore adapt the information\-guided noise allocation principle ofStancevicet al\.\([2025](https://arxiv.org/html/2605.07013#bib.bib18)\)andRayaet al\.\([2026](https://arxiv.org/html/2605.07013#bib.bib17)\)\. Let

hlog​\(σ\):=dd​log⁡σ​H​\(x0∣xσ\)h\_\{\\log\}\(\\sigma\):=\\frac\{d\}\{d\\log\\sigma\}H\(x\_\{0\}\\mid x\_\{\\sigma\}\)denote the conditional entropy rate per unit log\-noise\. This quantity identifies where the forward process destroys information about the clean bitstream, or equivalently where reverse denoising must resolve the most information\.

In practice, we estimate this profile online from the unweighted denoising errors already computed during training, obtaining a proxyh^log​\(σ\)\\widehat\{h\}\_\{\\log\}\(\\sigma\)\. This defines a normalized schedule density overu=log⁡σu=\\log\\sigma,

πα​\(u\)∝g​\(eu;c,n\)​h^log​\(eu\)α,g​\(σ;c,n\)=σnσn\+cn\.\\pi\_\{\\alpha\}\(u\)\\propto g\(e^\{u\};c,n\)\\,\\widehat\{h\}\_\{\\log\}\(e^\{u\}\)^\{\\alpha\},\\qquad g\(\\sigma;c,n\)=\\frac\{\\sigma^\{n\}\}\{\\sigma^\{n\}\+c^\{n\}\}\.\(7\)In the final runs we use the sqrt\-rate settingα=1/2\\alpha=1/2, withc=0\.1,n=3c=0\.1,n=3\. Training begins from the EDM log\-normal distribution, then transitions to the entropy\-rate distribution after warmup\.

### 3\.5Sampling: deterministic flow and stochastic correction

The deterministic sampler integrates the probability\-flow ODE induced by[Equation˜2](https://arxiv.org/html/2605.07013#S3.E2)\. WritingDθ​\(x,σ\)D\_\{\\theta\}\(x,\\sigma\)for the denoised bit probabilities,

d​xd​σ=−σ​sθ​\(x,σ\)=x−Dθ​\(x,σ\)σ\.\\frac\{dx\}\{d\\sigma\}=\-\\sigma s\_\{\\theta\}\(x,\\sigma\)=\\frac\{x\-D\_\{\\theta\}\(x,\\sigma\)\}\{\\sigma\}\.\(8\)We use DDIM\-style sampling in the main experiments and support Heun correction in the codebase\. A key component of our method is the use of entropic time warping, introduced in\(Dielemanet al\.,[2022](https://arxiv.org/html/2605.07013#bib.bib7)\)for softmax models and generalized in\(Stancevicet al\.,[2025](https://arxiv.org/html/2605.07013#bib.bib18)\)to arbitrary continuous diffusion models\. Unless stated otherwise, reported samples use the entropy\-rate grid and carry\-mode self\-conditioning\.

#### Entropy\-gated stochastic churn\.

We adopt an EDM\-style stochastic churn\(Karraset al\.,[2022b](https://arxiv.org/html/2605.07013#bib.bib19)\)\. Before a deterministic step fromσi\\sigma\_\{i\}toσi\+1\\sigma\_\{i\+1\}, the sampler temporarily moves to a noisier level

σ^i=\(1\+γi\)​σi,\\hat\{\\sigma\}\_\{i\}=\(1\+\\gamma\_\{i\}\)\\sigma\_\{i\},injects the corresponding Gaussian noise, and then applies the deterministic update from\(x^i,σ^i\)\(\\hat\{x\}\_\{i\},\\hat\{\\sigma\}\_\{i\}\)toσi\+1\\sigma\_\{i\+1\}\. For the main stochastic results, we use full\-band raw churn,γi=Schurn/N\\gamma\_\{i\}=S\_\{\\mathrm\{churn\}\}/N, withSnoise=1\.003S\_\{\\mathrm\{noise\}\}=1\.003\. The key point is that raw churn and effective continuous\-time stochasticity are not the same quantity\. In the small\-step regime, the churn update behaves like a probability\-flow step plus a local Langevin correction with effective reverse\-SDE strength

λi≈γi​σiσi−σi\+1\.\\lambda\_\{i\}\\approx\\frac\{\\gamma\_\{i\}\\sigma\_\{i\}\}\{\\sigma\_\{i\}\-\\sigma\_\{i\+1\}\}\.\(9\)Thus the sigma grid determines where a fixed raw churn budget becomes a large stochastic correction\. For a grid uniform in the CDF ofπα​\(u\)\\pi\_\{\\alpha\}\(u\), local spacing satisfies

σi−σi\+1≈σiN​πα​\(log⁡σi\)\.\\sigma\_\{i\}\-\\sigma\_\{i\+1\}\\approx\\frac\{\\sigma\_\{i\}\}\{N\\pi\_\{\\alpha\}\(\\log\\sigma\_\{i\}\)\}\.Substitutingγi=Schurn/N\\gamma\_\{i\}=S\_\{\\mathrm\{churn\}\}/Ninto[Equation˜9](https://arxiv.org/html/2605.07013#S3.E9)gives

λent​\(σi\)≈Schurn​πα​\(log⁡σi\)\.\\lambda\_\{\\mathrm\{ent\}\}\(\\sigma\_\{i\}\)\\approx S\_\{\\mathrm\{churn\}\}\\,\\pi\_\{\\alpha\}\(\\log\\sigma\_\{i\}\)\.\(10\)The entropy\-rate grid therefore has a dual role: it allocates deterministic solver points to information\-active noise levels and automatically converts constant full\-band raw churn into an information\-adaptive Langevin correction\. This is the mechanism behind the title phrase “entropy\-gated\.” The full derivation, finite\-step caveat, and implementation details are given in[Appendix˜D](https://arxiv.org/html/2605.07013#A4)\. We also support the asymmetric time\-interval label shift of Analog Bits\(Chenet al\.,[2023](https://arxiv.org/html/2605.07013#bib.bib6)\), which evaluates the denoiser at a slightly noisier time label and can help in some low\-NFE deterministic regimes\. Since its effect is weaker once stochastic churn is enabled, all main 256\-NFE stochastic results useη=0\\eta=0; implementation details and ablations are given in[Sections˜B\.3](https://arxiv.org/html/2605.07013#A2.SS3)and[E](https://arxiv.org/html/2605.07013#A5)\.

### 3\.6Sequence diffusion transformer and vocabulary\-boundary scaling

The denoiser is a sequence diffusion transformer operating at semantic lengthTT, not bit lengthT​mTm\. Themmanalog bits of each token are grouped into one patch and projected to a token embedding\. The trunk matches the scale of recent continuous DLM baselines: 12 Transformer blocks, hidden width 768, 12 attention heads, feed\-forward width 3072, RoPE, AdaLN\-zero time conditioning, SwiGLU activations, FlashAttention/SDPA kernels when available, dropout0\.10\.1, and BF16 training\. The output head expands each trunk token back to itsmmbits\. Each bit receives both global context from the Transformer patch token and a local skip path carrying the current noisy bit and self\-conditioning features\. The final bit logit is the sum of the contextual residual and the analytic matched\-filter logit in[Equation˜4](https://arxiv.org/html/2605.07013#S3.E4)\. A schematic and full adapter details are given in[Figures˜1](https://arxiv.org/html/2605.07013#S3.F1)and[B\.2](https://arxiv.org/html/2605.07013#A2.SS2)\. This representation removes the vocabulary\-sized output boundary used by most DLMs\. Simplex, one\-hot, embedding\-space, and discrete token\-state DLMs typically require a𝒪​\(T​V\)\\mathcal\{O\}\(TV\)output/loss interface\. By contrast, Bitstream Diffusion predictsm=⌈log2⁡V⌉m=\\lceil\\log\_\{2\}V\\rceilbit logits per semantic token, replacing this boundary with𝒪​\(T​log⁡V\)\\mathcal\{O\}\(T\\log V\)logits while preserving trunk lengthTT\. For LM1B,V=30,522V=30\{,\}522andm=15m=15, giving a raw output\-logit reduction ofV/m≈2035×V/m\\approx 2035\\times\. For OpenWebText,V=65,536V=65\{,\}536andm=16m=16, this reduction grows toV/m≈4096×V/m\\approx 4096\\times\. We measure the practical end\-to\-end memory and throughput impact in[Section˜4\.3](https://arxiv.org/html/2605.07013#S4.SS3), with structural comparisons in[Section˜A\.2](https://arxiv.org/html/2605.07013#A1.SS2)\.

## 4Experiments

We evaluate Bitstream Diffusion on LM1B and OpenWebText \(OWT\)\. Because standard DLMs can artificially reduce generative perplexity \(GenPPL\) by collapsing toward safe, frequent tokens, we evaluate GenPPL jointly with token\-frequency entropy\. This joint view is critical: we use GenPPL–entropy frontiers to robustly compare configurations and ensure improvements are not artifacts of vocabulary collapse, as suggested in\(Pynadathet al\.,[2026](https://arxiv.org/html/2605.07013#bib.bib15)\)\.

### 4\.1Experimental Setup

Datasets\.We evaluate on LM1B packed into 128\-token blocks \(m=15m=15,bert\-base\-uncased, 1M steps\) and OWT using 1024\-token blocks \(750K steps\)\. For OWT, we retain GPT\-2 tokenization for evaluation but train through a reversiblegpt2id\_bpe16codec that maps GPT\-2 ID sequences to dense 16\-bit code tokens; all metrics are computed only after inverse decoding to GPT\-2 IDs/text\. Both datasets use a 12\-layer SDT trunk \(width 768\) trained with AdamW, binary score matching, EDM loss weighting, and an entropy\-rate noise schedule; details are in[Appendices˜B](https://arxiv.org/html/2605.07013#A2)and[A\.3](https://arxiv.org/html/2605.07013#A1.SS3)\.

Metrics & Sampling\.GenPPL is computed with externalgpt2\-largeon 1024 generated samples\. We also report entropy: mean per\-sample token unigram entropy in nats, computed on evaluation token IDs \(bert\-base\-uncasedfor LM1B and inverse\-decoded GPT\-2 IDs for OWT\), ensuring comparability with GPT\-2\-tokenized baselines rather than our intermediate code vocabulary\. We sample with either the deterministic probability\-flow ODE or full\-band stochastic EDM churn \(Snoise=1\.003S\_\{\\mathrm\{noise\}\}=1\.003\); unless stated otherwise, results use 256 NFEs on the entropy\-rate grid\.

### 4\.2Main results

Table 1:Main single\-point comparison\.We report one representative operating point per external method and compare both generative perplexity and token\-frequency entropy\. For Bitstream Diffusion, we show both deterministic and full\-band stochastic 256\-NFE samplers using the same trained checkpoint\. To ensure robustness, our results are reported as the mean±\\pmstandard deviation across 10 random sampling seeds\. The deterministic row isolates the base probability\-flow sampler, while the stochastic row shows the effect of adding full\-band churn on the entropy\-rate grid at the same NFE budget\. The real\-data row provides the target values for each dataset\. On LM1B, the stochastic sampler matches the real\-data entropy regime while substantially improving over reported diffusion baselines\. On OWT, it improves over LangFlow at matched or slightly higher reported entropy using four times fewer NFEs; additional points on the stochastic sampler frontier are shown in[Figure˜2](https://arxiv.org/html/2605.07013#S4.F2)\.![Refer to caption](https://arxiv.org/html/2605.07013v1/x1.png)\(a\)LM1B\.
![Refer to caption](https://arxiv.org/html/2605.07013v1/x2.png)\(b\)OWT\.

Figure 2:GenPPL–entropy comparison\.Generative perplexity \(lower is better\) versus token\-frequency entropy on LM1B and OWT\. The deterministic curve isolates our base probability\-flow sampler\. The highlighted stochastic trajectory represents our optimized Pareto frontier: for each NFE budget, we swept the stochastic churn parameter across eight values and selected the operating point that gives a good balance between generative perplexity and entropy\. On both datasets, Bitstream Diffusion establishes a strictly superior operating frontier\. On LM1B, our 256\-NFE sampler decisively outperforms all prior diffusion baselines while operating exactly in the real\-data entropy regime\. On OWT, the stochastic frontier consistently dominates previous models, achieving better generative perplexity and entropy than prior 1024\-NFE baselines using only a fraction of the compute\.[Tables˜1](https://arxiv.org/html/2605.07013#S4.T1)and[2](https://arxiv.org/html/2605.07013#S4.F2)summarize the main comparison\. To keep the table focused, we report one representative operating point per external method\. For Bitstream Diffusion, we report both deterministic and stochastic 256\-NFE samplers using the same trained checkpoint and architecture\. The deterministic row isolates the base probability\-flow sampler, while the stochastic row shows the effect of adding full\-band churn on the entropy\-rate grid at the same NFE budget\. This distinction is central to the result: the deterministic sampler is already competitive, but it is consistently more contractive, whereas stochastic churn on the entropy\-rate grid improves the GenPPL–entropy trade\-off without changing the model or increasing NFE\.

LM1B\.As shown in[Tables˜1](https://arxiv.org/html/2605.07013#S4.T1)and[2](https://arxiv.org/html/2605.07013#S4.F2), the deterministic 256\-NFE sampler is already competitive, but adding full\-band stochasticity fundamentally shifts the Pareto frontier\. Our stochastic sampler obtains GenPPL59\.7659\.76at entropy4\.314\.31, exactly matching the real\-data entropy regime\. This substantially outperforms all prior diffusion baselines reported \(e\.g\., improving\>30\>30GenPPL points over LangFlow\) and successfully reaches the text quality regime of the autoregressive \(AR\) baseline \(Gen\.PPL=66\.70\\mathrm\{Gen\.\\ PPL\}=66\.70\) without sacrificing token diversity\.

OpenWebText\.On OWT, the deterministic 256\-NFE sampler is strong in GenPPL but visibly under\-entropic relative to the real data \(5\.445\.44\), indicating over\-contraction\. Adding full\-band stochasticity corrects this: our selected 256\-NFE sampler obtains GenPPL27\.0627\.06at entropy5\.265\.26, directly improving over LangFlow’s 1024\-NFE reported point using four times fewer denoiser evaluations\. Increasing the churn budget exposes a useful quality–diversity frontier \([Figure˜2](https://arxiv.org/html/2605.07013#S4.F2)\), reachingGen\.PPL=34\.35\\mathrm\{Gen\.\\ PPL\}=34\.35at entropy5\.325\.32, matching AR baseline quality while pushing closer to real\-data entropy\.Uncurated generated samples for both datasets are provided in[Appendix˜G](https://arxiv.org/html/2605.07013#A7)\.

### 4\.3Computational and memory efficiency

Standard continuous token\-space diffusion models require a vocabulary\-wide output boundary, forming denseB×T×VB\\times T\\times Vlogits at every denoising evaluation\. As context lengthTTand vocabulary sizeVVgrow, this becomes a substantial memory and throughput bottleneck\. Bitstream Diffusion replaces this boundary with a patched bitstream head that forms onlyB×T×⌈log2⁡V⌉B\\times T\\times\\lceil\\log\_\{2\}V\\rceillogits, while preserving the semantic sequence lengthTT\. To isolate this systems effect, we profile matched token\-space and bitstream models using synthetic batches, removing dataloader, tokenization, disk\-I/O, decoding, and host\-device transfer effects\. Both models use the same 12\-layer Sequence Diffusion Transformer trunk with widthd=768d=768, BF16 mixed precision, and identical profiling settings\. Full protocols and batch\-size sweeps are given in[Appendix˜F](https://arxiv.org/html/2605.07013#A6)\.

Table 2:End\-to\-end systems impact of the bitstream boundary\.We compare matched SDT trunks using either a vocabulary\-wide token boundary or the patched bitstream boundary\. For each dataset and mode, we report metrics at the largest common batch size that fits both models on the profiling GPU\. For training, the batch column denotes the per\-step optimization batch size; for generation, it denotes the generation batch\. Generation throughput is measured in completed semantic tokens per second and includes the full 128\-NFE denoising trajectory\.[Table˜2](https://arxiv.org/html/2605.07013#S4.T2)shows that the compact bitstream boundary produces substantial end\-to\-end gains\. During training, the shared Transformer trunk, optimizer state, and backward activations dilute the raw output\-logit reduction, but the practical gains remain large: Bitstream Diffusion reduces peak memory by1\.64×1\.64\\timesand improves throughput by2\.31×2\.31\\timeson LM1B, and by2\.46×2\.46\\timesand3\.29×3\.29\\timeson OWT\. This has a direct practical consequence: at larger contexts and vocabularies, the token\-space baseline reaches the single\-GPU memory limit at much smaller per\-device batches, forcing either smaller effective batches or additional GPUs to distribute the same workload\. By contrast, the bitstream boundary remains feasible at substantially larger per\-device batches, improving hardware utilization and reducing the need for memory\-driven parallelization\.

The advantage is larger during generation, where no optimizer state or backward activations are stored, making the output boundary a much larger fraction of the active memory footprint\. At LM1B scale, Bitstream Diffusion reduces generation memory from52\.5552\.55GiB to5\.255\.25GiB at batch size 1024, while doubling throughput\. At OpenWebText scale, where the token boundary hasT=1024T=1024andV=65,536V=65\{,\}536, Bitstream Diffusion reduces generation memory by19\.25×19\.25\\timesand improves throughput by2\.65×2\.65\\timesat the largest common generation batch, allowing substantially larger batches before reaching the same hardware memory limit\.

### 4\.4Ablations

[Table˜3](https://arxiv.org/html/2605.07013#S4.T3)isolates the main training\-side choices on LM1B\. Self\-conditioning is enabled in all variants\(Chenet al\.,[2023](https://arxiv.org/html/2605.07013#bib.bib6),[2026](https://arxiv.org/html/2605.07013#bib.bib14)\)\. All variants use the same 256\-NFE deterministic sampler and the same 256\-NFE stochastic churn sweep, so differences reflect the learned model rather than per\-variant sampler tuning\.

Table 3:Training ablations on LM1B\.All variants use 256 NFEs\. Stochastic results are aggregated over 10 sampling seeds and the same target\-churn sweepγ∈\[0\.11,0\.20\]\\gamma\\in\[0\.11,0\.20\]\. Entropy standard deviations are omitted because they are negligible \(<0\.01<0\.01\)\. Config A is the default protocol used in the main experiments\.The matched\-filter residual is the clearest training\-side component\. Removing the analytic independent\-bit posterior substantially worsens both deterministic and stochastic GenPPL, confirming that the closed\-form local Gaussian denoiser is a useful inductive bias: the network need not relearn isolated\-bit denoising and can instead focus on contextual dependencies\.

The objective comparison also supports binary score matching\. Replacing score matching with cross\-entropy worsens the deterministic sampler and, under stochastic sampling, reaches lower GenPPL only by undershooting real\-data entropy more strongly\. This is undesirable for our evaluation setting, where GenPPL must be interpreted jointly with token\-frequency entropy\.

The schedule comparison is a quality–diversity trade\-off rather than a clear win for either schedule\. EDM gives lower GenPPL, but consistently operates in a more contractive, lower\-entropy regime\. The entropy\-rate schedule gives strong GenPPL while matching the real\-data entropy under stochastic sampling\. We therefore use it as the default because it is both safer empirically and data\-driven: it adapts noise allocation to the measured bitstream information profile across datasets, tokenizers, encodings, or redundant representations\.

#### Why not discrete diffusion on bits?

We also tested a SEDD\-style discrete diffusion baseline\(Louet al\.,[2024](https://arxiv.org/html/2605.07013#bib.bib20)\)directly on LM1B bitstreams, preserving the same𝒪​\(log⁡V\)\\mathcal\{O\}\(\\log V\)boundary\. It substantially underperforms the matched token\-level SEDD baseline: at 256 NFEs, GenPPL is285\.09285\.09versus126\.28126\.28, at comparable entropy \([Appendix˜C](https://arxiv.org/html/2605.07013#A3)\)\. This suggests that the compact bit boundary alone is not enough to make bit\-level discrete diffusion competitive in this setting\.

#### Churn controls the quality–diversity frontier\.

[Figure˜3](https://arxiv.org/html/2605.07013#S4.F3)isolates stochastic churn on OWT at fixed NFE=256=256andη=0\\eta=0\. The deterministic sampler is over\-contractive: it undershoots real\-data entropy and gives worse GenPPL than the best stochastic settings\. Moderate full\-band churn improves both GenPPL and entropy, moving samples into a healthier quality–diversity regime\. Increasing churn further continues to raise entropy, but eventually worsens GenPPL\. ThusSchurnS\_\{\\mathrm\{churn\}\}is best understood as a test\-time frontier knob: moderate stochasticity corrects premature contraction, while excessive stochasticity injects too much uncertainty\.

![Refer to caption](https://arxiv.org/html/2605.07013v1/x3.png)\(a\)Entropy\-rate profile\.
![Refer to caption](https://arxiv.org/html/2605.07013v1/figures/sampling/owt_churn_sweep_nfe256_eta0_fullband.png)\(b\)Churn sweep at fixed 256 NFEs\.

Figure 3:Stochastic churn controls the OWT quality–diversity frontier\.Left:Normalized entropy\-rate profile over noise levelσ\\sigma; the shaded region marks the broad entropy\-active region used for this diagnostic\.Right:At fixed NFE=256=256andη=0\\eta=0, moderate full\-band churn improves both GenPPL and entropy relative to deterministic sampling, while excessive churn increases entropy at the cost of GenPPL\.
#### Sampler\-design ablations\.

Additional ablations in[Appendix˜E](https://arxiv.org/html/2605.07013#A5)support the default sampler choices\. The entropy\-rate grid improves deterministic sampling relative to the Karras grid and makes stochastic churn substantially more effective \([Figure˜4](https://arxiv.org/html/2605.07013#A5.F4)\)\. This indicates that stochasticity is not a generic benefit by itself: it is most useful when the sampling grid aligns both solver resolution and stochastic correction with the information\-active part of the trajectory\. The continuous\-time interpretation in[Appendix˜D](https://arxiv.org/html/2605.07013#A4)explains this behavior, since the effective Langevin strength depends on local grid spacing as well as the raw churn parameter\. The stochasticity\-window ablation further shows that the correction should be broad rather than hand\-localized \([Figure˜5](https://arxiv.org/html/2605.07013#A5.F5)\)\. Narrow entropy\-CDF windows are sensitive to their location, whereas broad windows are consistently stronger; the full entropy\-supported rangeq∈\[0,1\]q\\in\[0,1\]gives the best GenPPL in this sweep, withq∈\[0\.1,0\.9\]q\\in\[0\.1,0\.9\]close behind\. Thus the useful stochastic correction appears to act over a broad information\-active region, rather than through a single narrowly localized entropy band\.

## 5Discussion

This work demonstrates that continuous diffusion over semantic bitstreams can substantially narrow the gap between diffusion language models and autoregressive generation while also improving computational scaling properties\. Our results suggest that the key challenge for continuous language diffusion is not continuity itself, but the interaction between representation, parameterization, and sampling\. By replacing token\-space diffusion with fixed\-width semantic bitstreams, the model benefits from both structured inductive biases and a more favorableO​\(log⁡V\)O\(\\log V\)output scaling\. A central finding is the importance of entropy\-gated stochastic sampling\. Deterministic probability\-flow sampling is already competitive, but tends to be over\-contractive and under\-entropic\. Adding stochastic churn on the entropy\-rate grid consistently improves the quality–diversity trade\-off, suggesting that stochasticity is most effective when concentrated in information\-active regions of the reverse process\. The bitstream formulation may become especially valuable for future multilingual and multimodal models\. Large unified vocabularies spanning many languages, image tokens, audio units, or multimodal semantic codes can become prohibitively expensive for standard diffusion models withO​\(V\)O\(V\)output heads\. In contrast, bitstream diffusion scales only logarithmically with vocabulary size, potentially enabling much larger and more flexible vocabularies without the same output\-boundary bottleneck\. This suggests a promising direction toward unified generative architectures operating across heterogeneous modalities and languages\.

## 6Limitations

The experiments in this work are limited to approximately 130M\-parameter models on LM1B and OpenWebText\. Further research is needed to show whether the same gains persist at substantially larger scales comparable to modern frontier models\. Finally, diffusion language models remain more computationally expensive at inference time than autoregressive models due to iterative denoising\. Further advances in caching, distillation, accelerated samplers, or consistency\-based generation may be necessary to make diffusion\-based language generation practical in latency\-sensitive settings\.

## References

- J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. van den Berg \(2021\)Structured denoising diffusion models in discrete state\-spaces\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.07013#S2.p1.1)\.
- T\. Chen, R\. Zhang, and G\. Hinton \(2023\)Analog bits: generating discrete data using diffusion models with self\-conditioning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=3itjR9QxFw)Cited by:[§B\.3](https://arxiv.org/html/2605.07013#A2.SS3.p1.1),[§1](https://arxiv.org/html/2605.07013#S1.p3.1),[§2](https://arxiv.org/html/2605.07013#S2.p2.1),[§3\.3](https://arxiv.org/html/2605.07013#S3.SS3.p2.2),[§3\.5](https://arxiv.org/html/2605.07013#S3.SS5.SSS0.Px1.p1.9),[§4\.4](https://arxiv.org/html/2605.07013#S4.SS4.p1.1)\.
- Y\. Chen, C\. Liang, H\. Sui, R\. Guo, C\. Cheng, J\. You, and G\. Liu \(2026\)LangFlow: continuous diffusion rivals discrete in language modeling\.External Links:2604\.11748,[Link](https://arxiv.org/abs/2604.11748)Cited by:[§1](https://arxiv.org/html/2605.07013#S1.p2.1),[§2](https://arxiv.org/html/2605.07013#S2.p2.1),[§3\.3](https://arxiv.org/html/2605.07013#S3.SS3.p1.1),[§3\.3](https://arxiv.org/html/2605.07013#S3.SS3.p2.2),[§4\.4](https://arxiv.org/html/2605.07013#S4.SS4.p1.1)\.
- S\. Dieleman, L\. Sartran, A\. Roshannai, N\. Savinov, Y\. Ganin, P\. H\. Richemond, A\. Doucet, R\. Strudel, C\. Dyer, C\. Durkan, C\. Hawthorne, R\. Leblond, W\. Grathwohl, and J\. Adler \(2022\)Continuous diffusion for categorical data\.External Links:2211\.15089,[Link](https://arxiv.org/abs/2211.15089)Cited by:[§2](https://arxiv.org/html/2605.07013#S2.p2.1),[§3\.5](https://arxiv.org/html/2605.07013#S3.SS5.p1.2)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.07013#S1.p1.1)\.
- J\. Jo and S\. J\. Hwang \(2025\)Continuous diffusion model for language modeling\.External Links:2502\.11564,[Link](https://arxiv.org/abs/2502.11564)Cited by:[§2](https://arxiv.org/html/2605.07013#S2.p2.1)\.
- T\. Karras, M\. Aittala, T\. Aila, and S\. Laine \(2022a\)Elucidating the design space of diffusion\-based generative models\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.07013#S1.p1.1)\.
- T\. Karras, M\. Aittala, T\. Aila, and S\. Laine \(2022b\)Elucidating the design space of diffusion\-based generative models\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/arxiv.2206.00364)Cited by:[§D\.1](https://arxiv.org/html/2605.07013#A4.SS1.p1.2),[§3\.5](https://arxiv.org/html/2605.07013#S3.SS5.SSS0.Px1.p1.2)\.
- C\. Lee, J\. Yoo, M\. Agarwal, S\. Shah, J\. Huang, A\. Raghunathan, S\. Hong, N\. M\. Boffi, and J\. Kim \(2026\)One\-step language modeling via continuous denoising\.External Links:2602\.16813,[Link](https://arxiv.org/abs/2602.16813)Cited by:[§1](https://arxiv.org/html/2605.07013#S1.p2.1),[§2](https://arxiv.org/html/2605.07013#S2.p2.1),[§3\.3](https://arxiv.org/html/2605.07013#S3.SS3.p1.1)\.
- A\. Lou, C\. Meng, and S\. Ermon \(2023\)Discrete diffusion modeling by estimating the ratios of the data distribution\.External Links:2310\.16834,[Link](https://arxiv.org/abs/2310.16834)Cited by:[§2](https://arxiv.org/html/2605.07013#S2.p1.1)\.
- A\. Lou, C\. Meng, and S\. Ermon \(2024\)Discrete diffusion modeling by estimating the ratios of the data distribution\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 32819–32848\.Cited by:[Appendix C](https://arxiv.org/html/2605.07013#A3.p2.3),[§4\.4](https://arxiv.org/html/2605.07013#S4.SS4.SSS0.Px1.p1.3)\.
- P\. Pynadath, J\. Shi, and R\. Zhang \(2025\)CANDI: hybrid discrete\-continuous diffusion models\.External Links:2510\.22510,[Link](https://arxiv.org/abs/2510.22510)Cited by:[§2](https://arxiv.org/html/2605.07013#S2.p2.1)\.
- P\. Pynadath, J\. Shi, and R\. Zhang \(2026\)Generative frontiers: why evaluation matters for diffusion language models\.External Links:2604\.02718,[Link](https://arxiv.org/abs/2604.02718)Cited by:[§4](https://arxiv.org/html/2605.07013#S4.p1.1)\.
- G\. Raya, B\. Nguyen, G\. Batzolis, Y\. Takida, D\. Stancevic, N\. Murata, C\. Lai, Y\. Mitsufuji, and L\. Ambrogioni \(2026\)Information\-guided noise allocation for efficient diffusion training\.arXiv preprint arXiv:2602\.18647\.External Links:[Link](https://arxiv.org/abs/2602.18647)Cited by:[§B\.1](https://arxiv.org/html/2605.07013#A2.SS1.p1.1),[§3\.4](https://arxiv.org/html/2605.07013#S3.SS4.p1.1)\.
- D\. Roos, O\. Davis, F\. Eijkelboom, M\. Bronstein, M\. Welling, İ\. İ\. Ceylan, L\. Ambrogioni, and J\. van de Meent \(2026\)Categorical flow maps\.arXiv preprint arXiv:2602\.12233\.Cited by:[§1](https://arxiv.org/html/2605.07013#S1.p2.1)\.
- S\. S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. T\. Chiu, A\. Rush, and V\. Kuleshov \(2024\)Simple and effective masked diffusion language models\.External Links:2406\.07524,[Link](https://arxiv.org/abs/2406.07524)Cited by:[§2](https://arxiv.org/html/2605.07013#S2.p1.1)\.
- S\. S\. Sahooet al\.\(2025\)The diffusion duality\.External Links:2506\.10892,[Link](https://arxiv.org/abs/2506.10892)Cited by:[§2](https://arxiv.org/html/2605.07013#S2.p1.1)\.
- Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole \(2021\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PxTIG12RRHS)Cited by:[§1](https://arxiv.org/html/2605.07013#S1.p1.1)\.
- D\. Stancevic, F\. Handke, and L\. Ambrogioni \(2025\)Entropic time schedulers for generative diffusion models\.InAdvances in Neural Information Processing Systems,Note:NeurIPS 2025External Links:[Link](https://openreview.net/forum?id=EfDIApcjgI)Cited by:[§B\.1](https://arxiv.org/html/2605.07013#A2.SS1.p4.2),[§3\.4](https://arxiv.org/html/2605.07013#S3.SS4.p1.1),[§3\.5](https://arxiv.org/html/2605.07013#S3.SS5.p1.2)\.

## Appendix AAdditional method details

### A\.1Matched\-filter derivation

We derive the analytic matched filter used in[Equation˜4](https://arxiv.org/html/2605.07013#S3.E4)\. Consider a single clean bit

x0∼Bern​\(1/2\),x=x0\+σ​ϵ,ϵ∼𝒩​\(0,1\)\.x\_\{0\}\\sim\\mathrm\{Bern\}\(1/2\),\\qquad x=x\_\{0\}\+\\sigma\\epsilon,\\qquad\\epsilon\\sim\\mathcal\{N\}\(0,1\)\.The posterior log\-odds are

log⁡p​\(x0=1∣x,σ\)p​\(x0=0∣x,σ\)\\displaystyle\\log\\frac\{p\(x\_\{0\}=1\\mid x,\\sigma\)\}\{p\(x\_\{0\}=0\\mid x,\\sigma\)\}=log⁡p​\(x∣x0=1,σ\)p​\(x∣x0=0,σ\)\\displaystyle=\\log\\frac\{p\(x\\mid x\_\{0\}=1,\\sigma\)\}\{p\(x\\mid x\_\{0\}=0,\\sigma\)\}\(11\)=−\(x−1\)22​σ2\+x22​σ2\\displaystyle=\-\\frac\{\(x\-1\)^\{2\}\}\{2\\sigma^\{2\}\}\+\\frac\{x^\{2\}\}\{2\\sigma^\{2\}\}\(12\)=x−12σ2\.\\displaystyle=\\frac\{x\-\\frac\{1\}\{2\}\}\{\\sigma^\{2\}\}\.\(13\)Therefore the optimal independent\-bit denoiser is

Dind​\(x,σ\)=sigmoid⁡\(x−12σ2\)\.D\_\{\\mathrm\{ind\}\}\(x,\\sigma\)=\\operatorname\{sigmoid\}\\\!\\left\(\\frac\{x\-\\frac\{1\}\{2\}\}\{\\sigma^\{2\}\}\\right\)\.\(14\)In a language model, bits are not independent: neighboring bits encode token identities and distant tokens constrain which codes are plausible\. The matched\-filter residual parameterization uses the closed\-form independent posterior as a local baseline and asks the network to predict the contextual residual:

ℓθ​\(xσ,σ\)=rθ​\(xσ,σ,xsc\)\+clip⁡\(xσ−12σ2,−C,C\),Dθ=sigmoid⁡\(ℓθ\)\.\\ell\_\{\\theta\}\(x\_\{\\sigma\},\\sigma\)=r\_\{\\theta\}\(x\_\{\\sigma\},\\sigma,x\_\{\\mathrm\{sc\}\}\)\+\\operatorname\{clip\}\\\!\\left\(\\frac\{x\_\{\\sigma\}\-\\frac\{1\}\{2\}\}\{\\sigma^\{2\}\},\-C,C\\right\),\\qquad D\_\{\\theta\}=\\operatorname\{sigmoid\}\(\\ell\_\{\\theta\}\)\.\(15\)The clipping constantC=30C=30is used for numerical stability\. This decomposition prevents the model from spending capacity relearning the local Gaussian posterior of isolated bits and focuses the Transformer on contextual correction\.

### A\.2Vocabulary\-boundary scaling

Most diffusion language models retain a vocabulary\-sized prediction interface\. Simplex and one\-hot continuous models diffuse in vocabulary\-dimensional states; embedding\-space continuous models reduce the state dimension but still commonly predict vocabulary logits; and discrete token\-state models parameterize denoising distributions or transition scores overVVtokens\. Thus, their output/loss boundary scales as𝒪​\(T​V\)\\mathcal\{O\}\(TV\)\.

Table 4:Structural boundary scaling\.Herem=⌈log2⁡V⌉m=\\lceil\\log\_\{2\}V\\rceil\. Standard DLMs pay an𝒪​\(V\)\\mathcal\{O\}\(V\)output/loss boundary per token\. A naive bit\-level model removes this boundary but lengthens the Transformer sequence toT​mTm\. Bitstream Diffusion patches themmbits of each token into one trunk element, preserving semantic lengthTTwhile reducing the boundary to𝒪​\(log⁡V\)\\mathcal\{O\}\(\\log V\)logits per token\.
### A\.3Decoding

Generated analog bits are converted to binary codes by thresholding the final bit probabilities, equivalently the final logits\. The bits are grouped into fixed\-width codes and decoded by the dataset\-specific inverse mapping\.

For LM1B, eachm=15m=15\-bit group represents a rawbert\-base\-uncasedtokenizer ID\. For OWT, we support both the direct GPT\-2\-token interface and the fixed\-lengthgpt2id\_bpe16codec used in the main OWT runs\. In the codec case, eachm=16m=16\-bit group is first decoded to agpt2id\_bpe16code token, then inverted back to GPT\-2 token IDs and decoded to text\. GenPPL is always evaluated on the final decoded text; entropy is computed on recovered evaluation token IDs—bert\-base\-uncasedfor LM1B and GPT\-2 for OWT—not on intermediate code\-token IDs\.

## Appendix BTraining and architecture details

### B\.1Entropy\-rate schedule details

We use the online entropy\-rate noise allocation strategy proposed byRayaet al\.\[[2026](https://arxiv.org/html/2605.07013#bib.bib17)\], adapted to continuous diffusion over bitstreams\. For VE corruption,

xσ=x0\+σ​ϵ,ϵ∼𝒩​\(0,I\),x\_\{\\sigma\}=x\_\{0\}\+\\sigma\\epsilon,\\qquad\\epsilon\\sim\\mathcal\{N\}\(0,I\),the conditional entropy rate is

hlog​\(σ\):=dd​log⁡σ​H​\(x0∣xσ\)\.h\_\{\\log\}\(\\sigma\):=\\frac\{d\}\{d\\log\\sigma\}H\(x\_\{0\}\\mid x\_\{\\sigma\}\)\.By the I–MMSE relation and denoising\-score\-matching identities, this rate is related to a noise\-rescaled denoising error\. Thus, the losses already computed during training provide a practical online proxy for the information profile of the forward process\.

During training, we maintain a FIFO buffer of pairs\(σ,e\)\(\\sigma,e\), where

e​\(σ\)=1S​∑i=1S\(Dθ,i​\(xσ,σ\)−x0,i\)2e\(\\sigma\)=\\frac\{1\}\{S\}\\sum\_\{i=1\}^\{S\}\\left\(D\_\{\\theta,i\}\(x\_\{\\sigma\},\\sigma\)\-x\_\{0,i\}\\right\)^\{2\}is the unweighted bitwise denoising error\. Since our schedule is parameterized over log\-noise, we use

h^log​\(σ\)=e​\(σ\)σ2\+ε\\widehat\{h\}\_\{\\log\}\(\\sigma\)=\\frac\{e\(\\sigma\)\}\{\\sigma^\{2\}\+\\varepsilon\}\(16\)as the online proxy for entropy production per unitlog⁡σ\\log\\sigma\. The small constantε\\varepsilonis used only for numerical stability\.

The interval\[log⁡σmin,log⁡σmax\]\[\\log\\sigma\_\{\\min\},\\log\\sigma\_\{\\max\}\]is divided into log\-spaced bins\. Leth¯k\\bar\{h\}\_\{k\}be the average value ofh^log\\widehat\{h\}\_\{\\log\}in binkk, and letσk\\sigma\_\{k\}be the bin midpoint\. We define the bin sampling probabilities by

qk∝g​\(σk;c,n\)​h¯kα,g​\(σ;c,n\)=σnσn\+cn\.q\_\{k\}\\propto g\(\\sigma\_\{k\};c,n\)\\,\\bar\{h\}\_\{k\}^\{\\alpha\},\\qquad g\(\\sigma;c,n\)=\\frac\{\\sigma^\{n\}\}\{\\sigma^\{n\}\+c^\{n\}\}\.\(17\)The factorggis a smooth low\-noise gate that prevents the sampler from over\-concentrating on nearly resolved bits\. In the final runs, we use

α=12,c=0\.1,n=3\.\\alpha=\\frac\{1\}\{2\},\\qquad c=0\.1,\\qquad n=3\.
Equivalently, this defines a normalized schedule density overu=log⁡σu=\\log\\sigma,

πα​\(u\)=g​\(eu;c,n\)​h^log​\(eu\)α∫log⁡σminlog⁡σmaxg​\(ev;c,n\)​h^log​\(ev\)α​𝑑v\.\\pi\_\{\\alpha\}\(u\)=\\frac\{g\(e^\{u\};c,n\)\\,\\widehat\{h\}\_\{\\log\}\(e^\{u\}\)^\{\\alpha\}\}\{\\int\_\{\\log\\sigma\_\{\\min\}\}^\{\\log\\sigma\_\{\\max\}\}g\(e^\{v\};c,n\)\\,\\widehat\{h\}\_\{\\log\}\(e^\{v\}\)^\{\\alpha\}\\,dv\}\.\(18\)Training begins from the base EDM log\-normal noise distribution\. We warm up for 40K steps, linearly transition to the entropy\-rate distribution over the next 10K steps, and then sample noise levels from the entropy\-rate schedule\. For sampling, we construct a sigma grid by approximately inverting the CDF ofπα\\pi\_\{\\alpha\}, following the entropic integration viewpoint ofStancevicet al\.\[[2025](https://arxiv.org/html/2605.07013#bib.bib18)\]\.

### B\.2Architecture implementation details

The model is implemented as a sequence diffusion transformer specialized to continuous bitstreams\. The input sequence hasS=T​mS=Tmanalog\-bit positions, but the Transformer trunk operates at semantic lengthTTby patching themmbits of each token or code token into one trunk element\.

For continuous bits, the noisy bit input is centered by the data center1/21/2and scaled by

cin​\(σ\)=\(σ2\+σdata2\)−1/2\.c\_\{\\mathrm\{in\}\}\(\\sigma\)=\(\\sigma^\{2\}\+\\sigma\_\{\\mathrm\{data\}\}^\{2\}\)^\{\-1/2\}\.When self\-conditioning is enabled, the self\-conditioning prediction is centered and scaled in the same way and concatenated with the noisy\-bit embedding\. We use local Fourier features within each bit patch, RoPE in the Transformer trunk, and no absolute global Fourier features in the final configuration\.

The trunk uses 12 Transformer blocks with hidden width 768, 12 attention heads, feed\-forward width 3072, AdaLN\-zero time conditioning, SwiGLU feed\-forward layers, dropout0\.10\.1, and SDPA/FlashAttention kernels when available\. The time embedding is a sinusoidal embedding oflog⁡σ\\log\\sigma, projected into the model width and used by AdaLN\-zero modules\.

The output head is an optimal skip MLP head\. For each trunk token, a patch adapter expands the global token representation intommper\-bit hidden states\. In parallel, a local adapter embeds the current noisy bit and self\-conditioning features\. The global and local per\-bit features are added, modulated by a time\-conditioned AdaLN\-zero block, and mapped to one residual logit per bit\. The final logit adds the analytic matched\-filter term from[Equation˜4](https://arxiv.org/html/2605.07013#S3.E4)\.

### B\.3Asymmetric time intervals

We support the asymmetric time\-interval label shift introduced in Analog Bits\[Chenet al\.,[2023](https://arxiv.org/html/2605.07013#bib.bib6)\]\. When evaluating the denoiser at state noiseσstate\\sigma\_\{\\mathrm\{state\}\}, we optionally shift the sigma label toward a noisier adjacent level in log\-space:

σeval=exp⁡\(\(1−η\)​log⁡σstate\+η​log⁡σnoisier\)\.\\sigma\_\{\\mathrm\{eval\}\}=\\exp\\\!\\left\(\(1\-\\eta\)\\log\\sigma\_\{\\mathrm\{state\}\}\+\\eta\\log\\sigma\_\{\\mathrm\{noisier\}\}\\right\)\.\(19\)Hereη=0\\eta=0recovers the standard sampler\. Positiveη\\etacan improve some low\-NFE deterministic settings, especially when combined with self\-conditioning, but its effect is weaker and less consistent once stochastic churn is enabled\. Therefore all main 256\-NFE stochastic results useη=0\\eta=0\.

### B\.4Final LM1B configuration

- •Dataset: LM1B,bert\-base\-uncased, packed 128\-token blocks\.
- •Representation: raw binary tokenizer IDs, 15 bits/token, sequence length 1920 bits\.
- •Model: SDT, 12 blocks, width 768, 12 heads, FF width 3072, patch size 15\.
- •Head: optimal skip MLP, hidden size 128\.
- •Diffusion:σmin=0\.002\\sigma\_\{\\min\}=0\.002,σmax=80\\sigma\_\{\\max\}=80,ρ=7\\rho=7,σdata=0\.5\\sigma\_\{\\mathrm\{data\}\}=0\.5, data center0\.50\.5\.
- •Training: binary score matching, EDM weighting, global batch size 512, AdamW, learning rate3×10−43\\times 10^\{\-4\}, cosine decay, 2500 warmup steps, 1M optimizer steps\.
- •Entropy\-rate schedule: online sqrt\-rate allocation,α=1/2\\alpha=1/2, low\-noise gatec=0\.1,n=3c=0\.1,n=3, 40K warmup and 10K transition\.
- •Self\-conditioning: enabled, training probability0\.50\.5, carry\-mode at sampling\.
- •Logit parameterization: matched\-filter residual, center0\.50\.5, scale1\.01\.0, clip3030\.

### B\.5Final OpenWebText configuration

- •Dataset: OpenWebText with GPT\-2 tokenizer, train splittrain\[:\-100000\], validation splittrain\[\-100000:\]\.
- •Codec: fixed\-lengthgpt2id\_bpe16second\-stage code tokenizer, 1024 code tokens\.
- •Representation: raw binary code\-token IDs, 16 bits/token, sequence length 16384 bits\.
- •Model: same SDT trunk as LM1B, patch size 16\.
- •Diffusion, optimizer, self\-conditioning, matched\-filter residual, and entropy\-rate schedule: same as LM1B\.
- •Training: global batch size 512, AdamW, learning rate3×10−43\\times 10^\{\-4\}, cosine decay, 2500 warmup steps\.
- •Main reported checkpoint: 750K optimizer steps\.

## Appendix CDiscrete bitstream diffusion baseline

A natural diagnostic baseline is to keep the fixed\-width bitstream representation, but replace the continuous Gaussian corruption process with a discrete diffusion process over bits\. This preserves the favourable𝒪​\(log⁡V\)\\mathcal\{O\}\(\\log V\)output boundary, since each token is represented bym=⌈log2⁡V⌉m=\\lceil\\log\_\{2\}V\\rceilbinary variables, but removes the continuous analog state space used by Bitstream Diffusion\.

We evaluate this option using a SEDD\-style absorbing discrete diffusion baseline\[Louet al\.,[2024](https://arxiv.org/html/2605.07013#bib.bib20)\]on LM1B bitstreams\. For comparison, we also train a matched token\-level SEDD baseline on the same dataset\. Both baselines usebert\-base\-uncased, sequence lengthT=128T=128, a 12\-layer SDT trunk with width 768, 12 attention heads, feed\-forward width 3072, RoPE, AdaLN time conditioning, SwiGLU activations, dropout0\.10\.1, BF16 training, AdamW with learning rate3×10−43\\times 10^\{\-4\}, and a log\-linear absorbing discrete diffusion schedule\. The token baseline operates over the BERT vocabulary plus an absorbing mask state, while the bitstream baseline operates directly over the corresponding fixed\-width bit representation\. Both models are evaluated using the same Tweedie sampler sweep, 1024 generated samples, and the same externalgpt2\-largeGenPPL evaluator\.

Table 5:Discrete diffusion on bits versus tokens on LM1B\.Both baselines use a SEDD\-style absorbing discrete diffusion process and the same 12\-layer SDT trunk\. The bitstream baseline preserves the favourable𝒪​\(log⁡V\)\\mathcal\{O\}\(\\log V\)output boundary, but substantially underperforms the token\-level discrete baseline across all NFE budgets\. At high NFE, the token\-frequency entropies are comparable, indicating that the gap is not simply explained by entropy collapse\.The bitstream discrete baseline is substantially worse than the token\-level discrete baseline across the full sampling range\. At 256 NFEs, discrete diffusion on bits obtains GenPPL285\.09285\.09, compared with126\.28126\.28for token\-level discrete diffusion\. At 512 NFEs, the gap remains similarly large:272\.18272\.18versus118\.40118\.40\. The token\-frequency entropies are close in the high\-NFE regime, so the difference is primarily reflected in external sample quality rather than in a simple entropy\-collapse artifact\.

This diagnostic suggests that the compact bit boundary alone is not sufficient to make bit\-level discrete diffusion competitive in this setting\. This motivates our choice to use continuous Gaussian diffusion over bitstreams, rather than a discrete bit\-level absorbing process\.

## Appendix DStochastic churn as an information\-adaptive reverse SDE

We give a continuous\-time interpretation of the stochastic sampler used in Bitstream Diffusion\. The purpose is to explain why full\-band EDM\-style churn becomes information\-adaptive when combined with the entropy\-rate sampling grid defined by the normalized schedule densityπα\\pi\_\{\\alpha\}\(detailed in[Section˜B\.1](https://arxiv.org/html/2605.07013#A2.SS1)\)\. We focus on the clean limiting structure of the method: full\-band stochasticity,Snoise=1S\_\{\\mathrm\{noise\}\}=1, no asymmetric\-time\-interval label shift, and the small\-step regime where the EDM churn parameter is not clipped\. These assumptions remove finite\-step implementation details while preserving the mechanism relevant to the sampler used in practice\.

### D\.1Discrete implementation of EDM churn

For completeness, we first detail the discrete EDM\-style churn step used in practice\[Karraset al\.,[2022b](https://arxiv.org/html/2605.07013#bib.bib19)\]\. Before a deterministic step fromσi\\sigma\_\{i\}toσi\+1\\sigma\_\{i\+1\}, the sampler optionally increases the current noise level to

σ^i\\displaystyle\\hat\{\\sigma\}\_\{i\}=\(1\+γi\)​σi,\\displaystyle=\(1\+\\gamma\_\{i\}\)\\sigma\_\{i\},\(20\)x^i\\displaystyle\\hat\{x\}\_\{i\}=xi\+Snoise​σ^i2−σi2​zi,zi∼𝒩​\(0,I\),\\displaystyle=x\_\{i\}\+S\_\{\\mathrm\{noise\}\}\\sqrt\{\\hat\{\\sigma\}\_\{i\}^\{2\}\-\\sigma\_\{i\}^\{2\}\}\\,z\_\{i\},\\qquad z\_\{i\}\\sim\\mathcal\{N\}\(0,I\),\(21\)and then applies the deterministic update from\(x^i,σ^i\)\(\\hat\{x\}\_\{i\},\\hat\{\\sigma\}\_\{i\}\)toσi\+1\\sigma\_\{i\+1\}\. In our experiments, we useSnoise=1\.003S\_\{\\mathrm\{noise\}\}=1\.003\. The raw churn amount is

γi=\{min⁡\(Schurn/N,2−1\),σi∈𝒲,0,otherwise,\\gamma\_\{i\}=\\begin\{cases\}\\min\(S\_\{\\mathrm\{churn\}\}/N,\\sqrt\{2\}\-1\),&\\sigma\_\{i\}\\in\\mathcal\{W\},\\\\ 0,&\\text\{otherwise\},\\end\{cases\}\(22\)whereNNis the number of sampling intervals and𝒲\\mathcal\{W\}is the stochasticity window\.

### D\.2VE corruption and probability\-flow sampling

Consider the VE corruption process

xσ=x0\+σ​ϵ,ϵ∼𝒩​\(0,I\),x\_\{\\sigma\}=x\_\{0\}\+\\sigma\\epsilon,\\qquad\\epsilon\\sim\\mathcal\{N\}\(0,I\),\(23\)wherex0∈\{0,1\}Sx\_\{0\}\\in\\\{0,1\\\}^\{S\}for binary bitstreams\. Letpσp\_\{\\sigma\}denote the density ofxσx\_\{\\sigma\}\. The denoiser predicts

Dθ​\(x,σ\)≈𝔼​\[x0∣xσ=x\]\.D\_\{\\theta\}\(x,\\sigma\)\\approx\\mathbb\{E\}\[x\_\{0\}\\mid x\_\{\\sigma\}=x\]\.\(24\)By Tweedie’s formula, this induces the score estimate

sθ​\(x,σ\)=Dθ​\(x,σ\)−xσ2\.s\_\{\\theta\}\(x,\\sigma\)=\\frac\{D\_\{\\theta\}\(x,\\sigma\)\-x\}\{\\sigma^\{2\}\}\.\(25\)The deterministic probability\-flow sampler follows

d​x=−σ​sθ​\(x,σ\)​d​σ\.dx=\-\\sigma s\_\{\\theta\}\(x,\\sigma\)\\,d\\sigma\.\(26\)For a decreasing schedule

σ0\>σ1\>⋯\>σN,\\sigma\_\{0\}\>\\sigma\_\{1\}\>\\cdots\>\\sigma\_\{N\},define

Δi=σi−σi\+1\>0\.\\Delta\_\{i\}=\\sigma\_\{i\}\-\\sigma\_\{i\+1\}\>0\.An Euler step of[Equation˜26](https://arxiv.org/html/2605.07013#A4.E26)gives

xi\+1=xi\+Δi​σi​sθ​\(xi,σi\)\.x\_\{i\+1\}=x\_\{i\}\+\\Delta\_\{i\}\\sigma\_\{i\}s\_\{\\theta\}\(x\_\{i\},\\sigma\_\{i\}\)\.\(27\)

### D\.3A generalized reverse VE SDE

For any nonnegative functionλ​\(σ\)\\lambda\(\\sigma\), consider the reverse\-time SDE parameterized byd​r=−d​σ\>0dr=\-d\\sigma\>0:

d​x=\(1\+λ​\(σ\)\)​σ​∇xlog⁡pσ​\(x\)​d​r\+2​λ​\(σ\)​σ​d​Wr\.dx=\(1\+\\lambda\(\\sigma\)\)\\sigma\\nabla\_\{x\}\\log p\_\{\\sigma\}\(x\)\\,dr\+\\sqrt\{2\\lambda\(\\sigma\)\\sigma\}\\,dW\_\{r\}\.\(28\)With the exact score, all choices ofλ​\(σ\)\\lambda\(\\sigma\)preserve the same one\-time marginalspσp\_\{\\sigma\}\. The caseλ=0\\lambda=0recovers the probability\-flow ODE, while positiveλ\\lambdaadds Langevin correction\. With a learned score, we interpret

d​x=\(1\+λ​\(σ\)\)​σ​sθ​\(x,σ\)​d​r\+2​λ​\(σ\)​σ​d​Wrdx=\(1\+\\lambda\(\\sigma\)\)\\sigma s\_\{\\theta\}\(x,\\sigma\)\\,dr\+\\sqrt\{2\\lambda\(\\sigma\)\\sigma\}\\,dW\_\{r\}\(29\)as an SDE interpretation of the numerical sampler\.

Discretizing[Equation˜29](https://arxiv.org/html/2605.07013#A4.E29)fromσi\\sigma\_\{i\}toσi\+1\\sigma\_\{i\+1\}gives

xi\+1=xi\\displaystyle x\_\{i\+1\}=x\_\{i\}\+Δi​σi​sθ​\(xi,σi\)\\displaystyle\+\\Delta\_\{i\}\\sigma\_\{i\}s\_\{\\theta\}\(x\_\{i\},\\sigma\_\{i\}\)\(30\)\+λi​Δi​σi​sθ​\(xi,σi\)\+2​λi​σi​Δi​zi,zi∼𝒩​\(0,I\)\.\\displaystyle\+\\lambda\_\{i\}\\Delta\_\{i\}\\sigma\_\{i\}s\_\{\\theta\}\(x\_\{i\},\\sigma\_\{i\}\)\+\\sqrt\{2\\lambda\_\{i\}\\sigma\_\{i\}\\Delta\_\{i\}\}\\,z\_\{i\},\\qquad z\_\{i\}\\sim\\mathcal\{N\}\(0,I\)\.The first drift term is the probability\-flow step\. The remaining drift and noise terms form a Langevin correction with effective SDE strengthλi\\lambda\_\{i\}\.

### D\.4EDM churn as a local Langevin correction

As defined in[Equation˜21](https://arxiv.org/html/2605.07013#A4.E21), the EDM churn step first increases the noise level to

σ^i=\(1\+γi\)​σi\\hat\{\\sigma\}\_\{i\}=\(1\+\\gamma\_\{i\}\)\\sigma\_\{i\}\(31\)and perturbs the state by the corresponding additional Gaussian noise:

x^i=xi\+σ^i2−σi2​zi,zi∼𝒩​\(0,I\)\.\\hat\{x\}\_\{i\}=x\_\{i\}\+\\sqrt\{\\hat\{\\sigma\}\_\{i\}^\{2\}\-\\sigma\_\{i\}^\{2\}\}\\,z\_\{i\},\\qquad z\_\{i\}\\sim\\mathcal\{N\}\(0,I\)\.\(32\)Here we setSnoise=1S\_\{\\mathrm\{noise\}\}=1for the continuous\-time limit\. It then takes a deterministic probability\-flow step from\(x^i,σ^i\)\(\\hat\{x\}\_\{i\},\\hat\{\\sigma\}\_\{i\}\)toσi\+1\\sigma\_\{i\+1\}:

xi\+1=x^i\+\(σ^i−σi\+1\)​σ^i​sθ​\(x^i,σ^i\)\.x\_\{i\+1\}=\\hat\{x\}\_\{i\}\+\(\\hat\{\\sigma\}\_\{i\}\-\\sigma\_\{i\+1\}\)\\hat\{\\sigma\}\_\{i\}s\_\{\\theta\}\(\\hat\{x\}\_\{i\},\\hat\{\\sigma\}\_\{i\}\)\.\(33\)
Assume the small\-step regime

γi=𝒪​\(N−1\),Δi=𝒪​\(N−1\),\\gamma\_\{i\}=\\mathcal\{O\}\(N^\{\-1\}\),\\qquad\\Delta\_\{i\}=\\mathcal\{O\}\(N^\{\-1\}\),with full\-band unclipped churn \(γi=Schurn/N\\gamma\_\{i\}=S\_\{\\mathrm\{churn\}\}/N\), and local smoothness ofsθs\_\{\\theta\}\. Then

σ^i2−σi2=2​γi​σi2\+γi2​σi2,\\hat\{\\sigma\}\_\{i\}^\{2\}\-\\sigma\_\{i\}^\{2\}=2\\gamma\_\{i\}\\sigma\_\{i\}^\{2\}\+\\gamma\_\{i\}^\{2\}\\sigma\_\{i\}^\{2\},\(34\)so

x^i=xi\+2​γi​σi2​zi\+o​\(N−1/2\)\.\\hat\{x\}\_\{i\}=x\_\{i\}\+\\sqrt\{2\\gamma\_\{i\}\\sigma\_\{i\}^\{2\}\}\\,z\_\{i\}\+o\(N^\{\-1/2\}\)\.\(35\)Moreover,

\(σ^i−σi\+1\)​σ^i\\displaystyle\(\\hat\{\\sigma\}\_\{i\}\-\\sigma\_\{i\+1\}\)\\hat\{\\sigma\}\_\{i\}=\(Δi\+γi​σi\)​σi​\(1\+γi\)\\displaystyle=\(\\Delta\_\{i\}\+\\gamma\_\{i\}\\sigma\_\{i\}\)\\sigma\_\{i\}\(1\+\\gamma\_\{i\}\)\(36\)=Δi​σi\+γi​σi2\+𝒪​\(N−2\)\.\\displaystyle=\\Delta\_\{i\}\\sigma\_\{i\}\+\\gamma\_\{i\}\\sigma\_\{i\}^\{2\}\+\\mathcal\{O\}\(N^\{\-2\}\)\.\(37\)Replacingsθ​\(x^i,σ^i\)s\_\{\\theta\}\(\\hat\{x\}\_\{i\},\\hat\{\\sigma\}\_\{i\}\)bysθ​\(xi,σi\)s\_\{\\theta\}\(x\_\{i\},\\sigma\_\{i\}\)in the𝒪​\(N−1\)\\mathcal\{O\}\(N^\{\-1\}\)deterministic coefficient changes the update only by higher\-order terms\. Therefore

xi\+1=xi\\displaystyle x\_\{i\+1\}=x\_\{i\}\+Δi​σi​sθ​\(xi,σi\)\\displaystyle\+\\Delta\_\{i\}\\sigma\_\{i\}s\_\{\\theta\}\(x\_\{i\},\\sigma\_\{i\}\)\(38\)\+γi​σi2​sθ​\(xi,σi\)\+2​γi​σi2​zi\+h\.o\.t\.\\displaystyle\+\\gamma\_\{i\}\\sigma\_\{i\}^\{2\}s\_\{\\theta\}\(x\_\{i\},\\sigma\_\{i\}\)\+\\sqrt\{2\\gamma\_\{i\}\\sigma\_\{i\}^\{2\}\}\\,z\_\{i\}\+\\mathrm\{h\.o\.t\.\}The first term is exactly the probability\-flow Euler step in[Equation˜27](https://arxiv.org/html/2605.07013#A4.E27)\. The remaining two terms are a local Langevin correction at noise levelσi\\sigma\_\{i\}\.

Comparing[Equation˜38](https://arxiv.org/html/2605.07013#A4.E38)with the generalized SDE discretization in[Equation˜30](https://arxiv.org/html/2605.07013#A4.E30), both the additional drift and injected variance match to leading order when

λi≈γi​σiΔi=γi​σiσi−σi\+1\.\\boxed\{\\lambda\_\{i\}\\approx\\frac\{\\gamma\_\{i\}\\sigma\_\{i\}\}\{\\Delta\_\{i\}\}=\\frac\{\\gamma\_\{i\}\\sigma\_\{i\}\}\{\\sigma\_\{i\}\-\\sigma\_\{i\+1\}\}\}\.\(39\)This is the central identity: the raw churn parameterγi\\gamma\_\{i\}is not itself the effective continuous\-time stochasticity\. The local schedule spacingΔi\\Delta\_\{i\}also determines where the sampler applies strong Langevin correction\.

### D\.5Finite\-step caveat

The previous matching is asymptotic\. For finiteγi\\gamma\_\{i\}, the churn split step is not exactly equal to one Euler step of[Equation˜29](https://arxiv.org/html/2605.07013#A4.E29)with a single scalarλi\\lambda\_\{i\}, because drift matching and variance matching give slightly different values\.

Keeping the finite\-γi\\gamma\_\{i\}deterministic coefficient while ignoring the higher\-order score\-evaluation shift, the extra deterministic coefficient beyond the probability\-flow step is

\(Δi\+γi​σi\)​σi​\(1\+γi\)−Δi​σi\\displaystyle\(\\Delta\_\{i\}\+\\gamma\_\{i\}\\sigma\_\{i\}\)\\sigma\_\{i\}\(1\+\\gamma\_\{i\}\)\-\\Delta\_\{i\}\\sigma\_\{i\}\(40\)=γi​σi2\+γi​Δi​σi\+γi2​σi2\.\\displaystyle\\qquad=\\gamma\_\{i\}\\sigma\_\{i\}^\{2\}\+\\gamma\_\{i\}\\Delta\_\{i\}\\sigma\_\{i\}\+\\gamma\_\{i\}^\{2\}\\sigma\_\{i\}^\{2\}\.\(41\)Matching this toλi​Δi​σi​sθ\\lambda\_\{i\}\\Delta\_\{i\}\\sigma\_\{i\}s\_\{\\theta\}gives

λidrift=γi​σiΔi\+γi\+γi2​σiΔi\.\\lambda\_\{i\}^\{\\mathrm\{drift\}\}=\\frac\{\\gamma\_\{i\}\\sigma\_\{i\}\}\{\\Delta\_\{i\}\}\+\\gamma\_\{i\}\+\\frac\{\\gamma\_\{i\}^\{2\}\\sigma\_\{i\}\}\{\\Delta\_\{i\}\}\.\(42\)On the other hand, exact variance matching gives

λinoise=\(γi\+γi22\)​σiΔi\.\\lambda\_\{i\}^\{\\mathrm\{noise\}\}=\\left\(\\gamma\_\{i\}\+\\frac\{\\gamma\_\{i\}^\{2\}\}\{2\}\\right\)\\frac\{\\sigma\_\{i\}\}\{\\Delta\_\{i\}\}\.\(43\)These are not equal for finiteγi\\gamma\_\{i\}\. However, in the small\-step regimeγi=𝒪​\(N−1\)\\gamma\_\{i\}=\\mathcal\{O\}\(N^\{\-1\}\),Δi=𝒪​\(N−1\)\\Delta\_\{i\}=\\mathcal\{O\}\(N^\{\-1\}\), both reduce to

λi=γi​σiΔi\+𝒪​\(N−1\)\.\\lambda\_\{i\}=\\frac\{\\gamma\_\{i\}\\sigma\_\{i\}\}\{\\Delta\_\{i\}\}\+\\mathcal\{O\}\(N^\{\-1\}\)\.Thus[Equation˜39](https://arxiv.org/html/2605.07013#A4.E39)is the leading\-order continuous\-time interpretation of stochastic churn, not an exact finite\-step identity\.

### D\.6Entropy\-rate grids make full\-band churn information\-adaptive

For full\-band unclipped EDM churn, the raw churn budget is distributed uniformly across sampling intervals:

γi=SchurnN\.\\gamma\_\{i\}=\\frac\{S\_\{\\mathrm\{churn\}\}\}\{N\}\.\(44\)Substituting into[Equation˜39](https://arxiv.org/html/2605.07013#A4.E39)gives

λi≈SchurnN​σiσi−σi\+1\.\\lambda\_\{i\}\\approx\\frac\{S\_\{\\mathrm\{churn\}\}\}\{N\}\\frac\{\\sigma\_\{i\}\}\{\\sigma\_\{i\}\-\\sigma\_\{i\+1\}\}\.\(45\)Therefore the same raw churn budget induces different effective reverse SDEs under different sampling schedules\.

Letu=log⁡σu=\\log\\sigma, and letπα​\(u\)\\pi\_\{\\alpha\}\(u\)denote the normalized schedule density used to construct the entropy\-rate sampling grid, as defined in[Equation˜18](https://arxiv.org/html/2605.07013#A2.E18)\. The grid places sampling points approximately uniformly in the CDF ofπα\\pi\_\{\\alpha\}\. Thus consecutive log\-noise points satisfy

Fα​\(ui\)−Fα​\(ui\+1\)≈1N\.F\_\{\\alpha\}\(u\_\{i\}\)\-F\_\{\\alpha\}\(u\_\{i\+1\}\)\\approx\\frac\{1\}\{N\}\.For interior points whereπα\\pi\_\{\\alpha\}is smooth and positive,

ui−ui\+1≈1N​πα​\(ui\)\.u\_\{i\}\-u\_\{i\+1\}\\approx\\frac\{1\}\{N\\pi\_\{\\alpha\}\(u\_\{i\}\)\}\.\(46\)Since

σi−σi\+1≈σi​\(ui−ui\+1\),\\sigma\_\{i\}\-\\sigma\_\{i\+1\}\\approx\\sigma\_\{i\}\(u\_\{i\}\-u\_\{i\+1\}\),we obtain

σi−σi\+1≈σiN​πα​\(log⁡σi\)\.\\sigma\_\{i\}\-\\sigma\_\{i\+1\}\\approx\\frac\{\\sigma\_\{i\}\}\{N\\pi\_\{\\alpha\}\(\\log\\sigma\_\{i\}\)\}\.\(47\)Substituting into[Equation˜45](https://arxiv.org/html/2605.07013#A4.E45)yields

λent​\(σ\)≈Schurn​πα​\(log⁡σ\)\.\\boxed\{\\lambda\_\{\\mathrm\{ent\}\}\(\\sigma\)\\approx S\_\{\\mathrm\{churn\}\}\\,\\pi\_\{\\alpha\}\(\\log\\sigma\)\}\.\(48\)
Thus, under the entropy\-rate grid, constant full\-band raw churn induces an information\-adaptive reverse SDE\. In the configuration used for the main results,α=1/2\\alpha=1/2, so the effective Langevin strength is proportional to the normalized schedule density induced by the square root of the entropy\-rate proxy over log\-noise\. This makes the stochastic correction largest in the same noise region where denoising information is concentrated\.

For comparison, a Karras grid induces a smooth analytic stochasticity profile determined by the grid geometry rather than by the measured bitstream entropy\-rate\. This explains why the same raw stochastic sampler can behave very differently under Karras and entropy\-rate schedules, as observed in[Section˜E\.1](https://arxiv.org/html/2605.07013#A5.SS1)\.

### D\.7Interpretation for bitstreams

This effect is especially important for continuous diffusion over discrete bitstreams\. The noisy marginalpσp\_\{\\sigma\}is a mixture of Gaussians centered at hypercube vertices\. At high noise, the mixture components overlap strongly and bit identities are ambiguous\. At low noise, the components are separated and most bits are already determined\. Between these regimes, there is a comparatively narrow transition region where many posterior bit probabilities change rapidly\.

Finite\-step probability\-flow integration can be fragile in this transition region: discretization or score error may push coordinates toward incorrect hypercube vertices, after which the matched\-filter term and contextual denoiser can reinforce premature commitments\. The stochastic correction mitigates this failure mode by locally re\-randomizing ambiguous coordinates and allowing the denoiser to refreeze them under updated global context\.

The entropy\-rate schedule aligns two effects\. It places more deterministic solver points where denoising information is concentrated, and through[Equation˜48](https://arxiv.org/html/2605.07013#A4.E48)it also increases the effective Langevin correction in those same regions\. Hence full\-band churn on the entropy\-rate grid acts as an information\-adaptive predictor\-corrector sampler: both the deterministic discretization budget and the stochastic correction budget are concentrated where the bitstream is actively resolving information\.

## Appendix EAdditional sampling ablations

We provide additional sampling diagnostics supporting the sampler choices used in the main paper\. The main comparison already isolates the deterministic probability\-flow sampler from the stochastic sampler\. Here we focus on the internal stochastic\-sampler choices: the sigma grid, the churn window, the churn budget, and the asymmetric time\-lag parameter\.

### E\.1Entropic versus Karras sampling grids

[Figure˜4](https://arxiv.org/html/2605.07013#A5.F4)compares the entropy\-rate sampling grid against the standard Karras grid on LM1B at 64, 128, and 256 NFEs\. For each NFE, we show both deterministic sampling and a sweep over stochastic churn budgets, using the same broad entropy\-CDF churn windowq∈\[0\.1,0\.9\]q\\in\[0\.1,0\.9\]\. This experiment isolates the effect of the sampling grid while keeping the trained checkpoint fixed\.

There are two conclusions\. First, the entropy\-rate grid improves deterministic sampling itself\. The dashed entropy\-rate curves are consistently below the corresponding Karras dashed curves, especially at low NFE\. This indicates that the entropy\-rate grid is a better discretization of the probability\-flow trajectory for bitstreams: it allocates more solver resolution to the noise region where denoising error and bit uncertainty are concentrated\.

Second, the grid choice becomes even more important under stochastic churn\. On the entropy\-rate grid, moderate churn substantially improves GenPPL over the deterministic sampler, particularly at 128 and 256 NFEs\. At 256 NFEs, the best entropy\-rate stochastic configurations reach the GenPPL\-6060regime, while the Karras stochastic configurations remain above the Karras deterministic baseline\. A similar separation appears at 128 NFEs, where the entropy\-rate stochastic sampler reaches the GenPPL\-8080regime whereas Karras remains far higher\. Thus stochastic churn is not automatically beneficial; it is beneficial when the grid places the stochastic correction in the right part of the trajectory\.

This supports the sampler used in the main experiments: entropy\-rate sampling is useful as a deterministic solver grid, and it also makes stochastic churn more effective\. The reverse\-SDE analysis in[Appendix˜D](https://arxiv.org/html/2605.07013#A4)gives a continuous\-time explanation: the effective Langevin strength scales with the inverse local sigma spacing, so the entropy\-rate grid concentrates the stochastic correction near the information\-active noise region\.

![Refer to caption](https://arxiv.org/html/2605.07013v1/x4.png)Figure 4:LM1B schedule ablation\.We compare the entropy\-rate grid and the Karras grid at 64, 128, and 256 NFEs\. Dashed horizontal lines show deterministic sampling; solid curves show stochastic churn sweeps\. The entropy\-rate grid improves deterministic sampling and, more importantly, enables stochastic churn to reach a much better GenPPL regime than the Karras grid across all NFE budgets\.
### E\.2Stochasticity\-window ablation

[Figure˜5](https://arxiv.org/html/2605.07013#A5.F5)studies where stochastic churn should be applied\. We fix LM1B, NFE=256=256, the entropy\-rate grid, andη=0\\eta=0, and sweep both the churn budget and the entropy\-CDF window

𝒲=\[Q​\(qlo\),Q​\(qhi\)\]\.\\mathcal\{W\}=\[Q\(q\_\{\\mathrm\{lo\}\}\),Q\(q\_\{\\mathrm\{hi\}\}\)\]\.
The left panel varies the location of a fixed\-width 25% entropy\-CDF window\. No narrow location is uniformly best\. Low\-tail, centered, upper\-mid, and high\-tail windows all improve some configurations, but each leaves a clear gap to the best broad\-window results\. This suggests that the useful stochastic correction is not confined to one sharply localized part of the trajectory\.

The right panel varies the window width while keeping it centered\. Performance improves as the window becomes broader\. The full entropy\-supported rangeq∈\[0,1\]q\\in\[0,1\]gives the best GenPPL in this sweep, with the broadq∈\[0\.1,0\.9\]q\\in\[0\.1,0\.9\]setting close behind\. Narrow centered windows are substantially weaker\. We therefore use full\-band churn in the main comparison\.

This result is also consistent with the continuous\-time view in[Appendix˜D](https://arxiv.org/html/2605.07013#A4)\. Full\-band raw churn does not imply uniform effective stochasticity: on the entropy\-rate grid, the local grid spacing already concentrates the effective Langevin correction near the information\-active region\. Empirically, applying churn broadly is more robust than trying to hand\-target a narrow window\.

![Refer to caption](https://arxiv.org/html/2605.07013v1/x5.png)Figure 5:LM1B stochasticity\-window ablation at 256 NFEs\.Left: location sweep with a fixed 25% entropy\-CDF width\. Right: centered width sweep\. Narrow windows are sensitive to their location, while broad windows are consistently stronger\. The full entropy\-supported rangeq∈\[0,1\]q\\in\[0,1\]gives the best GenPPL in this sweep\.
### E\.3Additional OWT churn sweeps

The main text shows the OWT churn sweep atη=0\\eta=0\. In[Figure˜6](https://arxiv.org/html/2605.07013#A5.F6), we repeat the sweep atη=0\.6\\eta=0\.6\. The same qualitative behavior holds: small\-to\-moderate churn gives the best GenPPL–entropy trade\-off, while excessive churn raises token\-frequency entropy at the cost of worse GenPPL\. This confirms that the churn budget is the dominant sampler\-level control knob for moving along the quality–diversity frontier\.

The precise best operating point changes withη\\eta, but the overall frontier shape is stable: stochasticity first corrects the over\-contraction of the deterministic sampler, and then eventually becomes too strong\. This is why the main results tuneSchurnS\_\{\\mathrm\{churn\}\}but keepη=0\\eta=0for the default 256\-NFE stochastic comparison\.

![Refer to caption](https://arxiv.org/html/2605.07013v1/x6.png)Figure 6:OWT churn sweep at fixedη=0\.6\\eta=0\.6\.The qualitative behavior matches theη=0\\eta=0sweep: moderate churn gives the best GenPPL–entropy trade\-off, while excessive churn increases entropy at the cost of GenPPL\.
### E\.4Asymmetric time\-lag ablation

We also ablate the asymmetric time\-lag parameterη\\eta, which shifts the denoiser evaluation label toward a noisier adjacent level in log\-σ\\sigmaspace\.[Figure˜7](https://arxiv.org/html/2605.07013#A5.F7)reports changes in GenPPL and entropy relative toη=0\\eta=0, both for deterministic sampling and for the smallest nonzero stochastic churn configuration\.

Positiveη\\etacan improve GenPPL in some low\-NFE deterministic regimes, but the effect is not uniformly Pareto\-improving\. Once stochastic churn is enabled, the effect ofη\\etabecomes weaker and less consistent\. Entropy changes are also mixed across NFE\. We therefore treatη\\etaas a secondary tuning parameter: it can be useful for low\-NFE sweeps, but we useη=0\\eta=0for the main high\-NFE stochastic comparisons\.

![Refer to caption](https://arxiv.org/html/2605.07013v1/x7.png)

![Refer to caption](https://arxiv.org/html/2605.07013v1/x8.png)

Figure 7:Effect of asymmetric time\-lagη\\etaon OWT\.We show changes relative toη=0\\eta=0for deterministic sampling and for the smallest nonzero stochastic churn configuration\. Top: change in GenPPL\. Bottom: change in token\-frequency entropy\. Positiveη\\etacan help at low NFE in deterministic sampling, but its effect is weaker and less consistent once stochastic churn is enabled\.

## Appendix FComputational and Memory Efficiency

This appendix provides the scaling sweeps and boundary\-only analysis supporting the main\-text systems summary in[Section˜4\.3](https://arxiv.org/html/2605.07013#S4.SS3)\.

### F\.1Profiling protocol

All profiling experiments are executed on a single NVIDIA GH200 node using synthetic data to isolate architectural differences\. Synthetic batches remove dataloader throughput, tokenization, disk I/O, decoding, external scoring, cache writes, and host\-device transfer effects\. The measured differences therefore isolate the representation, output head, loss boundary, and denoising\-loop cost\.

We disabletorch\.compileto avoid conflating representation\-level effects with graph specialization, compilation warmup, or cache effects\. All models use BF16 autocast and CUDA peak allocated memory\. Throughput is reported in semantic tokens per second; for Bitstream Diffusion,mmgenerated bit logits correspond to one semantic token or code token\. Training measurements use AdamW, 10 warmup steps, and 50 timed forward–backward–optimizer steps\. Self\-conditioning is disabled, so each training step corresponds to one denoiser call\. Inference measurements use no gradients, 50 representative untimed denoiser calls for warmup, and five timed 128\-NFE synthetic generation trajectories\. The 128\-NFE setting is used for profiling efficiency: for a fixed sampler implementation, generation cost is approximately linear in the number of denoiser evaluations, so this benchmark captures the relative per\-evaluation systems cost\.

For each dataset, the token\-space baseline and Bitstream Diffusion use the same semantic sequence length, 12\-layer SDT trunk, hidden widthd=768d=768, number of heads, and mixed\-precision settings\. The token baseline uses a continuous one\-hot state, a vocabulary\-wide output head, and cross\-entropy\. The bitstream model uses patched analog bits, theoptimal\_skip\_mlpbit head, and a bitwise loss\. LM1B usesT=128,V=30,522,m=15T=128,V=30\{,\}522,m=15; OWT usesT=1024,V=65,536,m=16T=1024,V=65\{,\}536,m=16\.

### F\.2End\-to\-end training sweeps

[Tables˜6](https://arxiv.org/html/2605.07013#A6.T6)and[7](https://arxiv.org/html/2605.07013#A6.T7)show the training sweeps for batch sizesB≥16B\\geq 16\. Smaller batches are dominated by fixed overheads and under\-utilize the GPU, making their throughput ratios less representative of the high\-throughput regime used in DLM training\. As batch size increases, the denseB​T​VBTVtoken boundary increasingly penalizes memory and throughput\. The OOM rows also illustrate a practical scaling issue: the token\-space model reaches the per\-GPU memory limit earlier, which in large\-batch DLM training requires either smaller per\-device batches or more devices to distribute the same global workload\.

BatchPeak VRAM \(GiB\)Mem\.Semantic tok/sSpeedsizeTokensBitstream Diffusion↓\\boldsymbol\{\\downarrow\}TokensBitstream Diffusion↑\\boldsymbol\{\\uparrow\}163\.111\.901\.63×1\.63\\times45,78744,0310\.96×0\.96\\times324\.703\.001\.57×1\.57\\times90,32388,0210\.97×0\.97\\times648\.115\.201\.56×1\.56\\times156,924174,8451\.11×1\.11\\times12815\.299\.591\.59×1\.59\\times158,496260,4001\.64×1\.64\\times25629\.8718\.381\.62×1\.62\\times148,460278,1601\.87×1\.87\\times51259\.0435\.971\.64×1\.64\\times126,026290,7582\.31×2\.31\\timesToken baseline triggers Out\-Of\-Memory \(OOM\) at the next profiled batch size\.Table 6:Training\-step efficiency on LM1B\(T=128,V=30,522T=128,V=30\{,\}522\)\. The token baseline and Bitstream Diffusion share an identical SDT trunk and differ only in the representation and output/loss boundary\.BatchPeak VRAM \(GiB\)Mem\.Semantic tok/sSpeedsizeTokensBitstream Diffusion↓\\boldsymbol\{\\downarrow\}TokensBitstream Diffusion↑\\boldsymbol\{\\uparrow\}1622\.659\.642\.35×2\.35\\times112,542230,2252\.05×2\.05\\times3244\.7018\.462\.42×2\.42\\times98,642245,6642\.49×2\.49\\times6488\.7936\.122\.46×2\.46\\times77,710255,6423\.29×3\.29\\timesToken baseline triggers Out\-Of\-Memory \(OOM\) at the next profiled batch size\.Table 7:Training\-step efficiency on OWT\(T=1024,V=65,536T=1024,V=65\{,\}536\)\. The denseB​T​VBTVlogit tensor pushes the token baseline close to the single\-GPU memory limit by batch size 64, whereas the bitstream model remains substantially lighter and faster\.
### F\.3End\-to\-end generation sweeps

[Tables˜8](https://arxiv.org/html/2605.07013#A6.T8)and[9](https://arxiv.org/html/2605.07013#A6.T9)show generation\-time scaling at 128 NFEs for batchesB≥16B\\geq 16\. Throughput is measured in completed generated semantic tokens per second and includes the full denoising trajectory\. The bitstream advantage grows with batch size because the token\-space model repeatedly materializesB×T×VB\\times T\\times Vlogits at every denoising evaluation, whereas Bitstream Diffusion materializes onlyB×T×⌈log2⁡V⌉B\\times T\\times\\lceil\\log\_\{2\}V\\rceilbit logits\. The effect is especially pronounced on OWT, whereT=1024T=1024andV=65,536V=65\{,\}536\.

Table 8:Generation efficiency on LM1B at 128 NFEs\.Throughput measures completed generated semantic tokens per second and includes the full iterative denoising trajectory\. The benchmark excludes decoding, caching, and external evaluation\.MicroPeak VRAM \(GiB\)Mem\.Generated semantic tok/sSpeedbatchTokensBitstream Diffusion↓\\boldsymbol\{\\downarrow\}TokensBitstream Diffusion↑\\boldsymbol\{\\uparrow\}1614\.490\.9715\.00×15\.00\\times2,1865,3782\.46×2\.46\\times3228\.491\.6217\.56×17\.56\\times2,2165,6772\.56×2\.56\\times6456\.492\.9419\.25×19\.25\\times2,2195,8892\.65×2\.65\\timesToken baseline triggers Out\-Of\-Memory \(OOM\) at the next profiled batch size\.Table 9:Generation efficiency on OWT at 128 NFEs\.At OWT scale,T=1024T=1024andV=65,536V=65\{,\}536, making the vocabulary\-wide token boundary substantially more expensive\. At batch 64, Bitstream Diffusion uses only2\.942\.94GiB, while the token baseline already requires14\.4914\.49GiB at batch 16\.
### F\.4Analytic and boundary\-only scaling

To isolate the mechanism behind the end\-to\-end gains, we separately analyze the output/loss boundary\. A token\-space diffusion model formsB​T​VBTVlogits, whereas Bitstream Diffusion formsB​T​⌈log2⁡V⌉BT\\lceil\\log\_\{2\}V\\rceillogits, reducing the output tensor by a factor ofV/⌈log2⁡V⌉V/\\lceil\\log\_\{2\}V\\rceil\.[Table˜10](https://arxiv.org/html/2605.07013#A6.T10)shows that this factor is2035×2035\\timeson LM1B,4096×4096\\timeson OWT, and7529×7529\\timesfor a128,000128\{,\}000\-word vocabulary\.

We also profile the boundary module alone by feeding synthetic trunk activations into either a dense vocabulary head with cross\-entropy or the compact bitstream head with a bitwise loss\. As shown in[Table˜11](https://arxiv.org/html/2605.07013#A6.T11), the isolated boundary cost grows rapidly withTTandVV\. At OWT scale, the token boundary requires7\.747\.74GiB and50\.9550\.95ms per step, compared with0\.850\.85GiB and3\.263\.26ms for the bitstream boundary\. In the long\-context, large\-vocabulary setting, the isolated bitstream boundary is18\.1×18\.1\\timessmaller in memory and81\.0×81\.0\\timesfaster\. End\-to\-end gains are smaller because the full model also includes the shared Transformer trunk\.

Table 10:Analytic vocabulary\-boundary tensor sizes\.While the Transformer states are shared, the output/loss tensor shrinks fromB​T​VBTVtoB​T​⌈log2⁡V⌉BT\\lceil\\log\_\{2\}V\\rceilin Bitstream Diffusion\.Table 11:Isolated boundary\-only profiling\(B=16B=16\)\. This benchmark removes the Transformer trunk and isolates the cost of the vocabulary head and sequence loss computation\.
### F\.5Limitations

These profiling results measure systems cost, not sample quality\. They use synthetic states and random targets to isolate architectural scaling\. The token baseline is the most direct continuous token\-boundary comparison; embedding\-space or discrete\-token diffusion variants may differ in input\-state cost, loss implementation, and kernel efficiency\. However, methods that train through vocabulary\-wide logits, masked\-token scores, or transition scores retain an𝒪​\(B​T​V\)\\mathcal\{O\}\(BTV\)output/loss boundary\. Bitstream Diffusion removes this boundary while preserving the semantic Transformer sequence lengthTT\.

## Appendix GQualitative evaluation

Table 12:16 randomly selected generated samples from the LM1B checkpoint\.Generated Samplebaker ’ s three \- run homer gave houston a four \- game lead for the sixth time in six games\. \[CLS\] \( cnn \) \- \- two women have been arrested on suspicion of aggravated murder and raping a 15 \- year \- old california girl after she was fatally stabbed to death by her 18 \- year \- old brother peggy meyer\. \[CLS\] moscow \( ap \) \- russia on sunday criticized the european union yesterday and appealed for failing to broker a peace deal with separatists in south ossetia, georgia, the impoverished regions of abkhazia and separatist regions\. \[CLS\] " but we ’ re very confident, for one reason aat times\. \* we welcome comments that advance the story directly or with relevant tangential information\. \[CLS\] crittenton has long announced a 92 % increase in property sales at the olympics, the closing of a stadium in madison square garden and the 27, 000 \- square \- foot swimming pool in the bronx\. \[CLS\] kuerate said he had no intention of commenting on ticket sales\. \[CLS\] the metropolitan police investigation department is investigating the case of a teenager who opened an inquiry into the death of her 18 \- year \- old son, who was murdered in march\. \[CLS\] lavrov privately described the threat as " unacceptable " and called for a " unitednorth korea on march 27 imposed three rounds of sanctions against the country ’ s five permanent permanent permanent permanent permanent security council members \- \- china, russia, south korea, russia, china, canada, china, china, russia, japan, south korea, japan and south africa\. \[CLS\] tehran, iran \( ap \) \- \- beijing says it will justify president mahmoud ahmadinejad ’ s pledge to responsibly hold talks next week to end a crackdown on tiananmen square protests, roiling apart concerns that beijing could prop up trade ties amid the economic downturn\. \[CLS\] defending champion djokovic, out in the quarter \- finalss drug makers pfizer corp\. and pfizer inc rose 32 cents, or 7 percent, to $ 2\. 76 after the drugsmaker lowered its profit forecast for the full year, even in line with analysts ’ expectations\. \[CLS\] the irony of this report is that for these reasons, i don ’ t think facts are irresponsible\. \[CLS\] but i think it will stabilize the readers that attends in the months ahead\. \[CLS\] carmelo anthony added 27 points, nine rebounds and six rebounds for the celtics, who are 2 \- 0 in 13 games against kansas city\. \[CLS\] since then, her body has curtailed her, leavingto continue its overdraft program\. \[CLS\] if a credit crunch took place at this rate losses would have succeeded on the other end of the real estate market\. \[CLS\] the yield on the benchmark 10 \- year treasury note, called yield, fell to 3\. 83 percent from 3\. 85 percent late friday\. \[CLS\] biden repeatedly reaffirmed international deadly support for u\. s\. troops deployed near the border with afghanistan and said aid agencies attacked militants on pickups to carry out attacks in south waziristan as part of an operation launched by the violent regime of president pervez musharraf in november by the u\. sroadside bombs\. \[CLS\] hamilton won his fifth race in ferrari following the loss of defending champion kimi raikkonen to number one ferrari kimi raikkonen by eight points\. \[CLS\] washington \( ap \) \- \- fannie and freddie rose by scribbled a $ 1\. 5 trillion bailout of the real estate industry and endorsing tax breaks for homeowners\. \[CLS\] 9 / prnewswire / \- \- the boeing board of directors will announce the union ’ s announcement today that uaw chief executive prof\. charles mueller, who has been appointed as vice president of underage operations, is pleased to announce boeing ’ s retirement\. flaunt\. \[CLS\] " it ’ s gay in tokyo\. " \[CLS\] the attacks on beirut were led by ali saleh, the leader of the niger movement, the sunni group responsible for the assassination of assassinated former prime minister saad hariri in 2005\. \[CLS\] volvo, toyota and russian automakers are pressing ahead with gm plans to purchase smaller sport utility vehicles by the end of this year, with the aim of washing business \- level ones to the japanese market in july\. \[CLS\] harare, zimbabwe \( ap \) \- \- zimbabwe ’ s electoral commission says president hamid karzai is committed to starting a runoff result in elections, accordingpercent, to 943\. 79, and the nasdaq composite index fell 67\. 62, or 0\. 12 percent, to 2, 113\. 26\. \[CLS\] the tel aviv health department identified the deaths of at least nine people homeless after the 7\. 7 \- magnitude earthquake rocked its temblor about 4 : 30 p\. m\. \[CLS\] although the international monetary fund looks like a service for the imf and the association would readily apologise for this conservanor, it would be hard to pay off while delivering the unoisive thirst with the i \- b \- k \- j \- m \- o \- kfighters\. \[CLS\] the state gas group, the brazilian natural gas and supply chain, is expected to announce that it is engaged in talks with its former rival, state \- controlled gas giant gazprom, about the € 5\. 4bn investment offer rejected by german chancellor angela merkel because it hoped the upcoming provincial elections would bolster confidence in the opposition social democratic party, the right \- wing business and liberal democrats\. \[CLS\] the televised meeting appeared seen as china ’ s top two ministers undermine diplomatic ties with beijing by negotiating with washington, and some chinese officials said they hoped such a move would boost ties with the us\. \[CLS\] those fish mayjuly 2008\. \[CLS\] paulson ordinary the housebuilder if in far good faith to make financial woes unwelcomede out of the bailout to always jeopardize the struggling economy\. \[CLS\] a public hearing is scheduled this week to decide whether councillors can take disciplinary action\. \[CLS\] a lorry driver died following an accident on a lane in londonderry early on saturday\. \[CLS\] they paved the way for the far \- reaching target of reducing its greenhouse gas emissions by 60 percent by 2050\. \[CLS\] ann arbor, mich\., july 30 / prnewswire \- usnewswire / \- \- sept\.\. \[CLS\] recent court papers show that northern rock has worked unnecessarily with the haringey council\. \[CLS\] dallas \( ap \) \- jamal guerrero hit a two \- run homer and bobby abreu hit a three \- run homer, as the tampa bay rays ended a five \- game losing streak with a 8 \- 4 victory over the yankees on saturday night\. \[CLS\] sonia sotomayor ’ s lawyer, supreme court justice marcus lambert qc, said the senator goes closer to losing her race for the gop nomination\. \[CLS\] next article in sports \( 16 of 49 \) » a version of this article appeared in print on january 9,, ups, and his son, objectives\. \[CLS\] in midmorning trading, the dow jones industrial average fell 17\. 77, or 0\. 59 percent, while the standard & poor ’ s 500 index added 0\. 16 percent, and the nasdaq composite index rose 25\. 41, or 0\. 29 percent, to 1, 808\. 24\. \[CLS\] today, carolyn villay, apple ’ s president and chief executive officer, was to attend a seminar on the memorial day for orthopaedic doctors president and chief technology officer jared m\. loughner and several senior medical experts\. \[CLS\] among the dead was atype 2 diabetes, is one of the most vulnerable people in the world to pomegranate during the spreadsheets\. \[CLS\] the plant, which has reserves of 3m tonnes and 2\. 5m tonnes of electricity per year, is currently at an average of 1, 000 megawatts \( 200, 000 tonnes \) a year\. \[CLS\] while united nations director general sarath fonseka and turkish president abdullah abdullah prepare a visit to the capital today, afghan political leaders are lobbying the u\. n\. \- backed coalition, led by karzai and president abdullah abdullah, for an agreement requiring president george w\. bush to restruhe earned a bachelor ’ s degree in public health from harvard business school, a bachelor ’ s in psychology, a bachelor ’ s from harvard university school of business in new york and a bachelor ’ s degree in psychology\. \[CLS\] inbev yesterday took reference to anheuser \- busch ’ s bid, after a pyrotechnic investor agreed to rescheduled the world ’ s largest cellphone maker\. \[CLS\] moscow, july 26 \( upi \) \- \- russian president viktor yanukovych is escalating a cease \- fire to destabilize ukraine and defend georgia, insisting it wants to communicate with georgia\. \[CLS\]chief executive officer of the fa, jim ogden, said that the deal was " unbrieldy\. " \[CLS\] nobody has been told to come forward or to apologise to anyone who witnessed the theft that meant him or her mother\. \[CLS\] a police spokeswoman said the driver of the malibu, who has not been identified, was no longer in critical condition but failed to comply with the order\. \[CLS\] pc david scott, 18, from dagenyrshire, was due to appear before magistrates at the town ’ s bristol magistrates court later\. \[CLS\] the defendant is due to appear as witnesses later\. \[CLS\] it ’ s no surprisethe president ’ s covers, cnn reported sunday\. \[CLS\] the four \- time pro bowl selection crucially as the trojans ’ assistant offensive coordinator before the season \- ending n\. f\. l\. season, mike hanlon, came off the bench\. \[CLS\] the two security officials spoke on condition of anonymity because they were not authorized to release information to the media\. \[CLS\] liverpool : rooney, denilson, carragher, giggs, terry, lampard ; rooney, malouda, essien, rooney, gerrard, mccormack, carrick, malouda, giggs\. \[CLS\] davydenkoTable 12:16 randomly selected generated samples from the LM1B dataset\. \(Continued\)Table 13:4 randomly selected generated samples from the OWT dataset\. Sampling configuration: NFE=256,schurn=33\.15s\_\{\\mathrm\{churn\}\}=33\.15,γ=0\.13\\gamma=0\.13,snoise=1\.003s\_\{\\mathrm\{noise\}\}=1\.003, entropy window\[0,1\]\[0,1\],σterm=0\.0794\\sigma\_\{\\mathrm\{term\}\}=0\.0794, ATIη=0\\eta=0,\(GenPPL,entropy\)=\(27\.06,5\.26\)\.\(\\text\{GenPPL\},\\text\{entropy\}\)=\(27\.06,5\.26\)\.Generated Samplewhich said it would prevent a "chart" changes in law\.We’ll tell you what’s true\. You can form your own view\. From 15p EUR 0\.18 $0\.18 $0\.27 a day, more exclusives, analysis and extras\.The move came after Greece invested more than $400 million over its EU’s six year\-old euro, citing the controversial living protections that prevented "goingly" the countryelfra Broadla Smith, chief executive of the International Poverty Law Center\.We’ll tell you what’s true\. You can form your own view\.At The Independent, no one tells us what to write\. That’s why, in an era of political lies and Brexit bias, more readers are turning to an independent source\. Subscribe from just 15p a day for extra exclusives, events and ebooks – all with no ads\.Subscribe now<\|endoftext\|\>0 demonstrators\.Shape Created with Sketch\. Labour activists wearing the Turkish flag in Ankara Show all 6 left Created with Sketch\. Shape Created with Sketch\. Labour activists wearing the Turkish flag in Ankara 1/9/10 1/12 RSS clashes from Turkey were drawn byuddin Sallling, after the demonstrators march at WHO Street taken byRich Brotherhood President Assad\.Image copyright Getty Images 4/10 RSS march at WHO Street taken byRich Brotherhood President Assad Abdul Burmi \- 5/10 RSS march at WHO Street taken byRich Brotherhood President Assad Abdul Burmi \- 7/10uddin Sallling, after the demonstrators march at WHO Street taken byRich Brotherhood President Assad Abdul Burmi from Turkey were drawn by leaderuddin Sallling, after the demonstrators march at WHO Street taken byRich Brotherhood President Assad Abdul Burmi \- 9/10 RSS clashes from Turkey were drawn by leaderuddin Sallling, after the RSS march at WHO Street taken byRich Brotherhood President Assad\.Image copyright Getty Images 10/10 RSS march at WHO Street taken byRich Brotherhood President Assad\.Image copyright Getty Images 1/12 leaderuddin Sallling, after the demonstrators march at WHO Street taken byRich Brotherhood President Assad Abdul Burmi \- 11/10Dear fathers, As you can imagine, more people are reading The Jerusalem Post than ever before\. Nevertheless, traditional business models are no longer sustainable and high\-quality publications, like ours, are being forced to look for new ways to keep going\. Unlike many other news organizations, we have not put up a paywall\. We want to keep our journalism open and accessible and be able to keep providing you with news and analysis from the frontlines of Israel, the Middle East and the Jewish World\.Egypt’s prime minister, Majambeatil Abdul Abdjad, announced he would visit the backing ofuddin Sallling and, his vice\-president, also had been to go in office prior to the election\.Monday’s once\-bearing attacks have occurred just days after Al Qaeda led its in a series of protests over the time of Thursday’s demonstrations in Palestine, anticipating 2,500,000 people after three weeks of bilateral protest\.Shape Created with Sketch\. Shape Created with Sketch\. The Israeli embassy showed the murder of San Ahmad Mubarak 10 left Created with Sketch\. right Created with Sketch\. Shape Created with Sketch\. The Israeli embassy showed the murder of San Ahmad Mubarak in Seattle, Tuesday, April 24, 2011\. REUTERS/Chank Klail/UPI/BIDOWHUREDERS \(K\. Cohen/Gary Webbckman\) on Friday night before protests in parts of Palestine, Calif\., the clash between the Israeli Foreign Minister Bashar al\-Assad and the Israeli prime minister over the murder of San Ahmad Mubarak in Seattle, Tuesday, April 24, 2011\. REUTERS/Chank Klail/UPI/BIDOWHUREDERS \(K\. Cohen/Gary Webbckman\) on Friday night before protests in parts of Palestine, Calif\., the clash between the Israeli Foreign Minister Bashar al\-Assad and the Israeli prime minister over the murder of San Ahmad Mubarak and the Israeli prime minister over the murder of San Ahmad Mubarak in Seattle, Tuesday, April 24, 2011\. REUTERS/Chank Klail/UPI/BIDOWHUREDERS \(K\. Cohen/Gary Webbckman\)"It marks the strongest economic and ethnic over the last two years," Jibh Shah, a spokesman for the Hamas group in northern Israel, said in a statement on Friday night before protests in parts of Palestine, Calif\., the escalation of the receiver investment market, rising spending on residentialhouses\."With changes in capital gaining income, the Israeli government is continuing\. Users who will sign the federal nameWith an older senior being issued the federal name on Feb\. 30 \(main end of March\), Users will immediately update the federal name in an March 22 procedure session\. Users graduate on Feb\. 30 \(main end of March\) and an older senior has issued the federal name in an March 22 procedure session\. Users will immediately update the federal nameThis is the first "hirtating" event\. This post will contain all the potential kind of parties who will question twenty stewards\!1 :3\. South America from East American Draft: 2017 05:80 2018–2008\-17 04:38 pm PDTSouth Florida players return to MSU by starting Houston \(GLAA\), and they’ve been very familiar with taking this position\. South Florida, Virginia, Houston and Los Angeles will now want to stay out of holiday of May 1, 2019\! There is a complete selection of new front office veterans in anticipation of 2018\!On May 1, 2019, both players could get together with the qualifier all year 2017\!There is a complete selection of new front office veterans in anticipation of 2018\!On May 1, 2019, both players could get together with the qualifier all year 2017\! 2 :3\. South America from East American Draft: 2017 05:80 2018–2008\-17 04:38 pm PDTPlayers return to MSU by starting Houston \(GLAA\), and they’ve been very familiar with taking this position\.The Miami Buccaneers will be officially assigned to Kansas City in the straight batting finals of 2016\-16 MSU Saturday, Aug\. 3, 2018\.The Miami Buccaneers will be officially assigned to Kansas City in the straight batting finals finals of 2016\-16 MSU 3 : \.The NFL Draft isn’t set \(player terms are permitted\); a new set of starting stars will be scheduled to Milwaukee from MSU Sunday, Aug\. 3, 2018\.\.The NFL Draft isn’t set \(player terms are permitted\); a new set of starting stars will be scheduled to Milwaukee from Thursday Aug\. 3, 2018\.<\|endoftext\|\>This story was updated to become a senior scientist in physics and biology at the University of California, Berkeley\.Read next:"Uulu Goldkallic is "the founder of a deepest\-most\-infused form of sclerosis?" Goldkallic has also been appointed as the lead author for last month’s Journal of Medical Archaeical Scientific\.Dr\. Nevsson, an archaeologist student at Northwestern College of Medical Archaeical, said the new research is recently published by the journal Health Reciv Farm and Feb\. 9 journal Nature Archaeical\."When this research suggests that reduced spinal nerves can be associated with reduced phenectinal symptoms predictability corrupted elsewhere, or the combination of reduced cancer and phenectinal symptoms obstruct the effects of the counterpototiens culture," he said\.By realizing increased cancer impairability and predictability and increased cognitive impairability, researchers have adopted conclusions about whether products such as antibiotics could potentially function to handstrike the cancer or result that they are able to obstruct brain impairability and predictability\.Mr Nevsson, co\-authors and associate in neurophysivism at the National University of California, Public Health at the University of Berkeley said: "This preliminary work has been handled to power our exposure in human culture and the scientific understanding of what our brain functions are, including different kinds of physical functions and behavioral disappointments\."<\|endoftext\|\>Image copyright AP Image caption The virus ignited in Australian low\-conceived populations when it left died on jampary two months agoA Canadian scientist has learned about the discovery of the virus by former met speculationologist Isaiah Gorel Mythley told Radio 4 Radio\.The researchers think the virus could still spell an devastating process found in bacteria that have spun on jampary and the national wildfire\. The virus has been infected\.Chery Agyler, 61, a non\-supophysologist at Victoria, said he has no reason to man "its" evidence\."Fasassing this discovery is a move which only knows and keeps its arguments," he added\."There is an logical exchange between large parts of Australian populations which only carry places where they tend to access and prevent catastrophic bacteria," said Agyson\.He says the toxic virus such as the function needed to preempt the Australian low\-conceived populationsScientists believe the grossly outwarded virus was "daved by citizens into the wildfire as a damaging and difficult disease"\.’A British kendo passages’Image caption Science Australia is reviewing a complaint against bacteria that occurred in the UK and Northern Japan, the researchers sayThe intense discovery of the bacteria is found in more than 100,000 British kendo passages and other populations with only a network of chunks \(from up to 200,000 km or so\), Agyson said\.The virus is published in the journal Physical Sciencewhen it wants residents to make its way towards the corner of building\.Scoc partico announced in February that Seattle’s decision to launch that building will allow the company "to fulfill its policy of employing local communities\."While it allows residents to purchase roads in every area last week, Guoc partico has already urged the city to stand with the construction of 202nd Street, Twin United said to help engage the next owners of the building\.Business for greatness Read moreIn early 2015, Guoc partico signed an agreement with the city for a housing kit on 202nd Street\.Scoc partico also reached out to the city to operate the chemical algae in the city in the new kit, adding that he has not yet maintained a educational union\."Industry contributions can be held\. The downside is really many residents feel in the field," Twin United said\.White House chairman Seth Evans, but Guoc partico said he hopes the district will pull the brink of diarrhea while expanding the area\."Young residents with mutual rights will help less and improve atrocities than due to resistance or flashors across the country," Evans said\. "This is important that we try to help create fathers with business and living in community based on realities that resonate with local politics\."<\|endoftext\|\>Sen\. inducingicon Valley: Senators urge Trump to bar Huawei products from electric electric grid \| Ex\-security officials Trump emergency declaration \| New malicious cyber tool found \| Facebook faces questions on treatment of moderators moderatorsPhotos: Photos: former US Bre crane Ministry of RichmondRC\-23 supporters appear during a protest at thestownard Street in Fairfax Station on Sunday, October 26\. Hide Caption 6 of 39 Photos: Photos: former US Bre crane Ministry of RichmondRC\-23 Prime Minister Rick Sware speaks to journalists at thestownard Street in Fairfax Station on Sunday, July 26\. Hide Caption 7 of 39 Photos: Photos: former US Bre crane Ministry of RichmondRC\-23 supporters during protest besidestownard Street\.Photos: Photos: former US Bre crane Ministry of RichmondRC\-23 supporters Prime Minister Rick Sware speaks to journalists at thestownard Street in Fairfax Station on Sunday, July 26\. Hide Caption 9 of 39 Photos: Photos: former US Bre crane Ministry of RichmondRC\-23 at Richmond International Airport Sunday, October 25\. Hide Caption 10 of 39 Photos: Photos: former US Bre crane Ministry of RichmondRC\-23 International Airport Sunday, October 25\. Hide Caption 11 of 39 Photos: Photos: former US Bre crane Ministry of RichmondRC\-23\. Hide Caption 12 of 39 Photos: Photos: former US Bre crane Ministry of RichmondRC\-23 Prime Minister Rick Sware \(51\) speaks to journalists at thestownard Street in Fairfax Station on Sunday, October 26\. Hide Caption 13 of 39 Photos: Photos: former US Bre crane Ministry of RichmondRC\-23 supporters during a protest besidestownard Street in Fairfax Station on Sunday, October 26\. Hide Caption 14 of 39 Photos: former US Bre crane Ministry of RichmondRC\-23 supporters protest besidestownard Street in Fairfax Station on Sunday, October 26\. Hide Caption 15 of 39 Photos: Photos: Photos: former US Bre crane Ministry of RichmondRC\-23 besidestownard Street in Fairfax Station on Sunday, October 26\. Hide Caption 16 of 39 Photos: Photos: former US Bre crane Ministry of RichmondRC\-23 International Airport Sunday, October 25 \| AP AFP<\|endoftext\|\>Ready to fight back? Sign up for Take Action Now and get three actions in your inbox every week\. You will receive occasional promotional offers for programs that support The Nation’s journalism\. You can read our Privacy Policy here\. Sign up for Take Action Now and get three actions in your inbox every week\.Thank you for signing up\. For more from The Nation, check out our latest issueSubscribe now for as little as $2 a month\!Support Progressive Journalism The Nation is reader supported: Chip in $10 or more to help us continue to write about the issues that matter\. The Nation is reader supported: Chip in $10 or more to help us continue to write about the issues that matter\.Fight Back\! Sign up for Take Action Now and we’ll send you three meaningful actions you can take each week\. You will receive occasional promotional offers for programs that support The Nation’s journalism\. You can read our Privacy Policy here\. Sign up for Take Action Now and we’ll send you three meaningful actions you can take each week\.Thank you for signing up\. For more from The Nation, check out our latest issueTravel With The Nation Be the first to hear about Nation Travels destinations, and explore the world with kindred spirits\. Be the first to hear about Nation Travels destinations, and explore the world with kindred spirits\.Sign up for our Wine Club today\. Did you know you can support The Nation by drinking wine"Dear Reader, As you can imagine, more people are reading The Jerusalem Post than ever before\. Nevertheless, traditional business models are no longer sustainable and high\-quality publications, like ours, are being forced to look for new ways to keep going\. Unlike many other news organizations, we have not put up a little news\."SCROLL TO CONTINUE WITH CONTENT Help Keep Common Dreams Alive Our progressive news model only survives if those informed and inspired by this work support our effortsAdvertisements<\|endoftext\|\>President Hillary Clinton Hillary Diane Rodham ClintonREAD: Cohen testimony alleges Trump knew Stone talked with WikiLeaks about DNC emails County GOP in Minnesota shares image comparing Sanders to Hitler Holder: ’Time to make the Electoral College a vestige of the past’ MORE and Bernie Sanders Bernard \(Bernie\) SandersPush to end U\.S\. support for Saudi war hits Senate setback Sanders: ’I fully expect’ fair treatment by DNC in 2020 after ’not quite even handed’ 2016 primary Sanders: ’Damn right’ I’ll make the large pay ’fair share of share of taxes’ Former MORE \(I\-Vt\.\) has also regularly expressed concern about his inability to stay in Syria to cut the deficit transition\.At Monday of The Hill’s Morning, Sen\. Dianne FeinsteinHillicon Valley: Senators urge Trump to bar Huawei products from electric grid \| Ex\-security officials condemn condemn Trump emergency declaration \| New malicious cyber tool found Facebook faces faces questions treatment of moderators Key senators say administration should ban Huawei from tech in US electric grid Cots unflakably in effect after Politico reports Five reported MORE \(D\-N\.Y\.\)\. and Rep\. Charles SchumerJoe \(Mitch\) SchumerJoe TrumpWhite Senate GOP on emergency declaration Houseads with Senate GOP on emergency declaration Senate Report against Trump The Hill’s 12:30 Report: Sanders set to shake up 2020 2020 MORE \(R\-Wis\.\) said in an interview Thursday\. "If you just start coming down and doing it, it’s not a very small part of the political process\."The Democrats have been occupational to defend Congress as "right" and some Republicans view them as the Republican nominee, according to a new Poll poll\.ADVERTISEMENTDemocrats, including Senate Majority Leader Harry Reid Addison James \(Jack\) Trump The Hill’s Morning Report: Sanders set to shake up fight against Trump The Hill’s Morning Report: Sanders set to shake up fight over against Trump The Hill’s 12:30 Report: Sanders set to shake up 2020 2020 MORE \(R\-Ky\.\) said Tuesday in an email Thursday\. It’s hard to confirm why the statements are accurate\.ADVERTISEMENTSen\. Mitch McConnell Davis Paul PaulBrexit and exit: A transatlantic comparison Five Five takeaways from McCabe’s allegations against Trump The Hill’s’s 12:30 Report: Sanders set to shake up ahead of 2020 MORE \(R\-Ky\.\) Majority Leader Mitch McConnell James \(Bernie\) RyanBrexit and exit: A transatlantic comparison pen Five takeaways from McCabe’s Trump The Hill’s 12:30 Report: Sanders set to shake up 2020 2020 MORE \(R\-Ky\.\) Sen\. Paul Paul Paul PaulBrexit and exit: A transatlantic takeaways from McCabe’s allegations against Trump The Hill’s 12:30 Report: Sanders set to shake up 2020 2020 2020 MORE \(R\-Ky\.\) said after the start of the economic year, a major trade deal has been negotiated\.The Foreign Ministry confirmed Thursday off their announcement from an engaged source telling the state whether the government will end up using the international force to shut out the veto of the world’s trade talks\.The British newspaper has said that after the start of the economic year, a major trade deal has been negotiated\.It is off their announcement from an engaged source telling the state whether the government will end up using the international force to shut out the veto of the world’s trade talks\.The Foreign Ministry said in a statement Thursday that the UK will integrate a deal into a temporary dispute and could discuss partners in the UK\.We’ll tell you what’s true\. You can form your own view\. From 15p EUR 0\.18 $0\.18 $0\.27 a day, more exclusives, analysis and extras\.The European Commission will call for a treaty with the European Union to help resist both the trade deals and the Trans\-Pacific Partnership \(IPA\) agreement with UK leaders\.France was "delakbing" at the exact conclusion of an ongoing conflict with the senior countries in 2013\.The Foreign Ministry said at a briefing that the deal will consider the conclusion of an agreement which forces the UK to resist its deals between the two countries\.Speaking earlier Thursday, the Foreign Ministry said it would not discuss if theseTable 13:4 randomly selected generated samples from the OWT dataset\. Sampling configuration: NFE=256,schurn=33\.15s\_\{\\mathrm\{churn\}\}=33\.15,γ=0\.13\\gamma=0\.13,snoise=1\.003s\_\{\\mathrm\{noise\}\}=1\.003, entropy window\[0,1\]\[0,1\],σterm=0\.0794\\sigma\_\{\\mathrm\{term\}\}=0\.0794, ATIη=0\\eta=0\. \(Continued\)Table 14:4 randomly selected generated samples from the OWT dataset\. Sampling configuration: NFE=256,schurn=45\.9s\_\{\\mathrm\{churn\}\}=45\.9,γ=0\.18\\gamma=0\.18,snoise=1\.003s\_\{\\mathrm\{noise\}\}=1\.003, entropy window\[0,1\]\[0,1\],σterm=0\.0794\\sigma\_\{\\mathrm\{term\}\}=0\.0794, ATIη=0\\eta=0,\(GenPPL,entropy\)=\(34\.83,5\.32\)\.\(\\text\{GenPPL\},\\text\{entropy\}\)=\(34\.83,5\.32\)\.Generated Samplewould be added to legal authority\."In practice, it should be the meaningful relief, any of \[the Special Capitalism Government\] before prodion would be changed," Shahur said\.He continued that the PM refused to run the PM but granted plans to prodion tell his Congress\."The questioning of prodion tell his Congress that I do not know how to make entitlements and are subject to those required to file borrowed," Shahur said\."These changes would hardly involve holding the Special Capitalism Government borrowed from rules that are enacted by Parliament," she said\.Maur said changes cannot be debated by KOIS in Samamik Bihar, Andures Nadu and Karnataka\.Firstpost is now on WhatsApp\. For the latest analysis, commentary and news updates, sign up for our WhatsApp services\. Just go to Firstpost\.com/Whatsapp and hit the Subscribe button\.<\|endoftext\|\>The Denver Browns will be eating home for NFL coaches and football teams early Friday\. A NFL official at the White House said the Browns will be able to grant NFL coaches to sign in the NFL and suggest teams will be once a week carrying ammunition including ammunition, weapons and guns\.NEWSLETTERS Get the Gn Wgs newsletter delivered to your inbox We’re sorry, but something went wrong Get the pulse of Arizona – Local news, in\-depth state state writers and what\. \( means, Ch to your inbox\. Please try again soon, or contact Customer Service at 1\-888\-424\-022527\. Delivery: Varies Invalid email address Thank you\! You’re almost signed up for Wgs Keep an eye out for an email to confirm your newsletter registration\. More newslettersAnd Friday, the Denver Browns said it will not extend new arrangements for the coaches to sign in the NFL while playing the best ends backs\.Joe Stanal may be reached at…al@globe\.com \. Follow him @jalachamal, or on Facebook on Facebook\.<\|endoftext\|\>Carlos Democratic nominee Donald Trump, get on a GOP presidential challenger, drew to criticism over support for the bid on Wednesday to take re\-election in April as his Republican chairman and the presumptive Democratic board of directors\.The New York GOP nomination is scheduled next week to meet Republican nominee’s 3rd side on the Saturday Night Live tour and watch the GOP’s first\-ever sitcom TV movie and will reportedly be portrayed as fastest for "goldors" for presumptive elections on the Sunday Night Live starting on Wednesday and Tuesday\.We’ll tell you what’s true\. You can form your own view\. From 15p EUR 0\.18 $0\.18 $0\.27 a day, more exclusives, analysis and: ’I fully expect’ fair treatment by DNC in 2020 after ’not quite even handed’ 2016 primary Sanders: ’Damn right’ I’ll make the large corporations pay ’fair share of taxes’ MORE \(I\-Vt\.\), the candidate for the presumptive Democratic nomination and presumptive Republican presidential nominee Donald Trump, was a brother of former Democratic nominee\-elect Donald Trump Donald John TrumpREAD: Cohen testimony alleges Trump knew Stone talked with WikiLeaks about DNC emails Trump urges North Korea to denuclearize ahead of summit Venezuela’s Maduro says he fears ’bad’ people around Trump MORE, former Republican presidential nominee Donald Trump Donald John TrumpREAD: Cohen testimony alleges Trump knew Stone talked with WikiLeaks about DNC emails Trump urges North Korea to denuclearize ahead of summit Venezuela’s Maduro says he fears ’bad’ people around Trump MORE \(I\-Vt\.\), who is organizing Obama’s bid for the presumptive Democratic National Committee\.ADVERTISEMENTThe GOP nomination is scheduled to take re\-election in April when he signaled Republican presidential nominee Hillary Clinton Hillary Diane Rodham ClintonSanders: ’I fully expect’ fair treatment by DNC in 2020 after ’not quite even handed’ 2016 primary Sanders: ’Damn right’ I’ll make the large corporations pay ’fair share of taxes’ Former Sanders campaign spokesman: Clinton staff are ’biggest a–holes in American politics’ MORE and Vermont Sen\. Bernie Sanders Bernard Bernard \(Bernie\) Sanders to end U\.S\. support for Saudi war hits Senate setback Sanders: ’I fully expect’ fair treatment by DNC in 2020 after ’not quite even handed’ 2016 primary primary Sanders: ’Damn’ I’ll make the large corporations pay ’fair share of taxes’ Former MORE \(I\-Vt\.\) on Wednesday\.ADVERTISEMENTThe New York GOP nomination was originally on the Sunday Night Live starting Friday and Tuesday\.We’ll tell you what’s true\. You can form your own view\.At The Independent, no one tells us what to write\. That’s why, in an era of political lies and Brexit bias, more readers are turning to an independent source\. Subscribe from just 15p a day for extra exclusives, events and ebooks – all with no ads\.Subscribe nowcamping, wearing galay pabbs\."Senior speaker James Simque, a spokesman for the Catholic evangelical Church of Jesus Christ of Latter\-day Saints, characterized the "performance" as a tragedy\."It’s a situation\. It’s the tragedy that is necessary," he said\."I’m trying to put them back on that process\."Dunque said Lanett was never immediately involved\. An incident happened at 5:45 p\.m\. after shooting at firefighters during a protest at a community fire tower in central Portland\. Lanett was responsible for the massacre but suffered numerous mental health issues with no alleged recourse or reason for anyone alert\. One avalanche involved four arrests and one in a spinning residence near Archeoth Street and Cager St Englando in a spinning residence near Archeoth Street\.The tsunami at 200 corner of Archeoth Street where hospital erupted at about 4:55:30 a\.m\., Sabden Bell, a nurse, was shot near the 540st Temple of Redlake St\.On a deadly scene at 6:05 a\.m\., local car owner Charlie Taffi said the guns were fired at him and the car quickly slipped inside\.Sherden Taffi said all of the men grabbed a pool in the street but spotted Bell in the more while she shot out a gun when she saw a weapon\."I didn’t know what to do," he said\. "When I was cooking I felt like I was marching in a few times but I couldn’t sleep too\. I couldn’t do a gun\. I think made her exit for good\."Masteri said Lanett was covered with a "wushling eye pistol\.""there’s a assembled meal\. It’s a crime at stake as such," she said\.Her session party will be held at the Archeoth Polating College of Dader\-Aallinian University near Hue of the Catholic Catholic Church to date Hospital\.Anyone with information is on board situation could contact Portland State Police to call her at 307\-282\-5232\. Email her at jlurt@freep\.netFollow her on Twitter: @barmlurt<\|endoftext\|\>Looking for news you can trust?Subscribe to our free newsletters\.The former front\-runner reported that Donald Trump will ask former top strategist Michael Flynn to leave the investigation of the possible ties between the Republican nominee Donald Trump during a plan to keep up the FBI office in the 2016 election, according to Kansas Democratic National Convention chairman Tim Gillansson\.Trump previously confirmed a new plan to cultivate throughout the Republican\-led investigation\."The tool responsible for the investigation is not a second motive," Gillansson told me on Sunday night\.The announcement came just a few one day before Trump urged FBI Director James Comey and FBI director Michael Flynn to move forward, unless mandated\."There are few reasons for them to help this investigation," Trump said Wednesday leading through an exclusive interview with the DOJ\.Kumann said she also criticized Trump for investigating allegations, saying that the issue "is most improbable\."Donald Trump has appointed frequent U\.S\. Sen\. Ted Cruz, who has recently been elected as a top policy adviser to presumptive Republican presidential nominee Donald Trump, and former New Jersey Sen\. Lindsey Graham \(R\-Ill\.\), Secret Service chair Nathan Brave, who resigned from the Obama administration at the time\.Obama has also urged Senate Republicans to extend a plan to ask Comey to figure out the possible ties between the Republican nominee Hillary Clinton\.Following a news conference on Tuesday, Gillansson answered\."I encourage you to continue meeting with FBI Director Michael Comey not to get a thorough collection process because the process would not have been negotiated," Brave said in a statement led by Wall Street Journal journalist Brad Mullke\."Deputy is thatEverybody else understand that this very comprehensive investigation sparked the controversy\."<\|endoftext\|\>Two Republican representatives of Colin Powell announced the notice of pending charges against U\.S\. Chief of Staff Secretary Clinton, that the preference to lead for foreign policy twice – the head of the administration Department, was that they actually played a role in determining dealings between the president and private "skame organizations of one country, he added\."President Powell sought to update this position from his senior national security adviser\.On Thursday, former former House member Fred Issa said he was responding to his testimony during a meeting before the Senate Appropriations Committee to indicate the public acting as an unidentified FBI officer would presumably cause confusion through the establishment of a federal investigation\.President Powell sought to update this position from his senior national security adviser\.On Thursday, former House member Fred Issa said he was responding to his testimony during a meeting before the Senate Appropriations Committee to indicate the public acting as an unidentified FBI officer would presumably cause confusion through the establishment of a federal investigation…\./7\)<\|endoftext\|\>An Palestinian group that legitimizes Palestinians in the Gaza Strip and the northeastern West Bank slams Mahmoud Mohammed Salah in the West Bank in the east of West Bank\.The shooting took place Wednesday, Aug\. 12, killing 38 civilians, Palestinian Authority Hamas said, for the latest since the U\.S\.\-Israel coup started\.Israel is presenting a rebuke of public protests against Israel and the Palestinian side of the Gaza Strip in two areas, the Foreign Ministry announced Friday\.The Guardian revealed that pro\-Israel group unveiled a more concerted campaign to attack Palestinians in the West Bank, which is the territory of the northeastern Gaza Strip and northern Gaza\.Shortly after the attack was picked up by an Israeli army commander, who was shot from the left leg and suffering from four Sanctings\.Israel has reported that Jerusalem and one of the major settlements hit a First West Bank area of Bryuboo that killed seven people, the Foreign Ministry said\. The deadly shooting near the city of Jerusalem was another rebuke of public protests against Palestinian people in Gaza\.The Foreign Ministry said the Gaza area was occupied by Israel and its current churches, causing devastating harm and mass atrocities amongst Israeli people in Gaza\.UN officials said that Israel on Saturday demanded a crackdown against Palestinian President Mahmoud Mohammed Salah\.A First West Bank shooting did not address reports\. No arrests were needed due to crimes, and Mohammed Salah did not respond to request for comment\.A government source said the shooting has been secretly captured by Israeli forces\.Israel spokesman Alasser Al Walhha say both victims pose a confrontation between Palestinians and Palestinian President Mahmoud Mohammed Salah\.A UN source did not immediately respond to requests for comment\.Keep updated: Sign up to our newsletter Email \* Please enter a valid email address Sign up Please wait… Thank you for signing up\. We’ve got more newsletters we think you’ll find interesting\. Click here Oops\. Something went wrong\. Please try again later\. Try again Thank you, The email address you have provided is already registered\. Close<\|endoftext\|\>San Antonio City Idaho Marquivul’s Adutio Sajul SajalaAlter’s anch Adutio Sajul Sajala is an international center for health and environmental impacts, according to a report compiled by the San Francisco Register\. "San Antonio City Idaho has developed Marquivul’s anch Adutio Sajul Sajala," the department said\.The risk study presented by the CDC to limit the risk for disease in humans remains "decision" in the past decade and calculates the worst weight memory loss, the department said\.Yillivul’s anch Adutio Sajul Sajala affects patients, according to the report by The San Francisco Register\.The study found an "sasured stall" attached in the risk prevention system to manage the incidence of opioid disease called Marquivul’s anch Adutio Sajul Sajala\."Psychletsri Medical opioid disease because of mammalian brain cancer may actually contain its effects," the department said in a statement\.The decline in levels of opioid disease has emerged from 2010 to 2007\. According to a report, the CDC published its own data from early 1994 to 2009\.More than 2\.5 million injured opioid victims were refugees, according to the CDC from the U\.S\. Centers for Disease Control and Prevention\.The report compared the number of people with brain disorders related to a study 2008 through 2009 at the National journal Medical Letters that "drug infections and brain disorders are the same as dementia, a small majority of people do not qualify for the infection of the opioid disease," the CDC said\.Nearly 60,000 refugees are infected with technology and brain therapy and brain respiratory therapy are rare, the CDC said\.In 2010, the number of injured opioid victims visited Idaho, and more than 14,500 refugees have been infected in recent years\.Last week, more than 1\.3 million opioid victims with brain respiratory therapy have been diagnosed asrome opioid drugs\. Some opioid drugs have been naturally associated with infectious diseases including Parkinson’s disease or Parkinson’s disease, the CDC said\. Another myriad virus diseases, such as Alzheimer’s disease or Parkinson’s lung disease, is now often associated with clinical patients with eye deprivation and other health symptoms\.About 8,500 refugees are infected, the CDC said\.More than 2\.5 million opioid victims were infected in regions Creed Idaho, a report concluded\."San Antonio City Idaho has been contracted to quantify the magnitude of the virus," the CDC said\."The CDC website is also part of the study and offers a dialogue to patients with the Hawann Watson County Sheriff’s Office," the report said\.Copyright 2014 by Nexstar Broadcasting\.com\. All rights reserved\. This material may not be published, broadcast, rewritten or redistributedsupport for same\-sex marriage at a presidential debate in San Diego, thereby citing broad laws eliminating marriage\.<\|endoftext\|\>ES News Newsletter Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid You already have an account\. Please log in or register with your social accountLos Angeles Police charged David Duuolo, 37 at 4:50 p\.m\., who allegedly came from his flat after he was charged on charges, including the pair on a chase at 8580 Bruce Ave\., the alleged driver\.Mituolo, 37, was returning his 67\-year\-old son from his apartment in San Antonio, New York at 2:45 a\.m\.\. Photo: Ian Mizzing\)San Francisco Police Department detectives said Wednesday they contacted any investigation for one incident\. Phil County Officer James Mitchell contacted Waron Sheriff Sgt\. Ian Mizzing\. The investigation was over, 41 hours before crash, police said\.Investigators said they noticed an unarmed man hitting anything from his flat at about 5 a\.m\. and accidentally injuring his back patrol vehicle, police have said\.City police said they might see Duuolo be into a flat, breaking his hi into a flat and killing him on her head, said Waron Phil County Attorney Randy Vyjit\.Jeremy Duuolo and his friends were caught near a recovered passenger about 2:50 p\.m west of 633 Bruce Avenue Monday\.About days after the crash an passenger flight crashed from a road at about 8:30 a\.m\. Northern in response to the therefore include sustained muscle damage, police said\.Officials and the Federal Safety Administration have not provided any details\. Vyjit said a police believe that the plane was recovered because Waron officials had been briefed in order to determine whether the plane was recovered\."It’s reasonable to assume that this occurred," Vyjit told Business Insidermedia\.com\."They believe that the plane was recovered\. \. \. \. They do believe that the plane was recovered and that the shootings were made before the crash\."The detectives have responded to multiple arrests for the first time\. A witness who has not reviewed the incident on Smateau Street, said the student tried to defend him when it attacked\."I don’t think I don’t don’t want my brother to go home anymore," Vyjit said\.The student allegedly killed her 61\-year\-old son\. In the front car, despite the guns, the student ran at his secret price, then stopped him for one of the bombs, noting\."Ah, I don’t like the student just didn’t want to turn me on the ground, but his son’s student did not want to imprison him," Vyjit said\.Jeremy Duuolo and his friends, whose wife, Martha, went to San Antonio where he never needed to as I as they started cooking\. Tulkabce occurred between Smateau Street and Jordll’s Park in Parkfield, Mo\., Maddery Street and 10th Street in San Antonio, Wash\. Vyjit told Business Insidermedia\.com that the police department is closed investigation\."I believe it is damaging\. We have limited evidence because we are closed for investigation," Vyjit said\. "I think that there was power at all\."Daters and Waron officials are engaged together in this shooting\."I don’t want to realize that he is a person who is typically maximised and presreilled\. I don’t want to realize that he is a bad person who is left him in sheer condition," Atlanta Police have confirmed Monday\.Police are investigating some of the most dangerous people in Atlanta on occasion back because of contemplating and heightened sex violence for 3 weeks\.\(Related More\)<\|endoftext\|\>You can eat your own gtadummyis in a very popular dish: Chocolate and idiosyncres are actually unique food\. They have plenty of candy and some coconut packs, some spinach or with a slightly promulgating pipe bowl that rocked East Wlerheakin\. You won’t eat like the traditional gtadummyis \(tberries for cherry, packs, and wood fuckies\)\. They show a popular dish for under an hour\. Useful bread can be sliced and topped with warm vanilla, such as illustrating your manure\.The premium gtadummyis is becoming also a popular dish because, after home as one of the most productive food restaurants on the list\. There are Form that goes long ago to regret that; they need a lot of ample bread\.Jookien, which is called East Wlerheakin, is to take to the hospital gtadummyis is good, but won’t include idiosyncres uncomfortable with wine powder and a series of juicy bread sauce\. This is an odd texture and six coconut drinks and the bowl works really well using premium qualityTable 14:4 randomly selected generated samples from the OWT dataset\. Sampling configuration: NFE=256,schurn=45\.9s\_\{\\mathrm\{churn\}\}=45\.9,γ=0\.18\\gamma=0\.18,snoise=1\.003s\_\{\\mathrm\{noise\}\}=1\.003, entropy window\[0,1\]\[0,1\],σterm=0\.0794\\sigma\_\{\\mathrm\{term\}\}=0\.0794, ATIη=0\\eta=0\. \(Continued\)

Similar Articles

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Hugging Face Daily Papers

LangFlow presents the first continuous diffusion language model that rivals discrete diffusion approaches, challenging the long-held belief that continuous diffusion is inferior for language modeling. The work introduces key ingredients like optimal Gumbel-based noise scheduling and demonstrates competitive perplexity and transfer learning performance compared to discrete diffusion baselines.

TextLDM: Language Modeling with Continuous Latent Diffusion

Hugging Face Daily Papers

This paper introduces TextLDM, a method that adapts visual latent diffusion transformers for language modeling by mapping discrete tokens to continuous latents. It demonstrates that this approach, enhanced by representation alignment, matches GPT-2 performance and unifies visual and text generation architectures.

CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language

arXiv cs.CL

CRoCoDiL proposes a continuous and robust conditioned diffusion approach for language that shifts masked diffusion models into a continuous semantic space, achieving superior generation quality and 10x faster sampling speeds compared to discrete methods like LLaDA.