Language Modeling with Hyperspherical Flows

arXiv cs.LG 05/13/26, 04:00 AM Papers
Summary
This paper introduces S-FLM, a novel flow-based language model that operates in a hyperspherical latent space to address the computational costs and semantic limitations of existing discrete diffusion and continuous flow models.
arXiv:2605.11125v1 Announce Type: new Abstract: Discrete Diffusion Language Models progressed rapidly as an alternative to autoregressive (AR) models, motivated by their parallel generation abilities. However, for tractability, discrete diffusion models sample from a factorized distribution, which is less expressive than AR. Recent Flow Language Models (FLMs) apply continuous flows to language, transporting noise to data with a deterministic ODE that avoids factorized sampling. FLMs operate on one-hot vectors whose dimension scales with the vocabulary size, making FLMs costly to train. Moreover, since all distinct one-hot embeddings are equidistant in $\ell_2$, adding Gaussian noise does not have a clear semantic interpretation (unlike images, where Gaussian noise progressively degrades structure). We introduce $\mathbb{S}$-FLM, a latent FLM in the hypersphere. $\mathbb{S}$-FLM generates sequences by rotating vectors in $\mathbb{S}^{d-1}$ along a velocity field learned with cross-entropy, avoiding the overhead of materializing one-hot vectors. Previous FLMs match AR in Generative Perplexity (Gen.\ PPL), but samples with high likelihood are not necessarily correct in verifiable domains such as math and code. $\mathbb{S}$-FLM substantially improves continuous flow language models on large-vocabulary reasoning and closes the gap to masked diffusion under standard-temperature sampling ($T=1$), while a gap remains under optimized low-temperature ($T=0.1$) decoding.
Original Article
View Cached Full Text
Cached at: 05/13/26, 06:30 AM
# Language Modeling with Hyperspherical Flows
Source: [https://arxiv.org/html/2605.11125](https://arxiv.org/html/2605.11125)
Justin Deschenaux EPFL Lausanne, Switzerland justin\.deschenaux@epfl\.ch &Caglar Gulcehre EPFL, Lausanne, Switzerland Microsoft AI

###### Abstract

Discrete Diffusion Language Models progressed rapidly as an alternative to autoregressive \(AR\) models, motivated by their parallel generation abilities\. However, for tractability, discrete diffusion models sample from a factorized distribution, which is less expressive than AR\. Recent Flow Language Models \(FLMs\) apply continuous flows to language, transporting noise to data with a deterministic ODE that avoids factorized sampling\. FLMs operate on one\-hot vectors whose dimension scales with the vocabulary size, making FLMs costly to train\. Moreover, since all distinct one\-hot embeddings are equidistant inℓ2\\ell\_\{2\}, adding Gaussian noise does not have a clear semantic interpretation \(unlike images, where Gaussian noise progressively degrades structure\)\. We introduce𝕊\\mathbb\{S\}\-FLM, a latent FLM in the hypersphere\.𝕊\\mathbb\{S\}\-FLM generates sequences by rotating vectors in𝕊d−1\\mathbb\{S\}^\{d\-1\}along a velocity field learned with cross\-entropy, avoiding the overhead of materializing one\-hot vectors\. Previous FLMs match AR in Generative Perplexity \(Gen\. PPL\), but samples with high likelihood are not necessarily correct in verifiable domains such as math and code\.𝕊\\mathbb\{S\}\-FLM substantially improves continuous flow language models on large\-vocabulary reasoning and closes the gap to masked diffusion under standard\-temperature sampling \(T=1T=1\), while a gap remains under optimized low\-temperature \(T=0\.1T=0\.1\) decoding\.

![Refer to caption](https://arxiv.org/html/2605.11125v1/x1.png)
![Refer to caption](https://arxiv.org/html/2605.11125v1/x2.png)

Figure 1:Accuracy on GSM8KatT=1T=1\.Left:Decoding strategies for𝕊\\mathbb\{S\}\-FLM with the𝕊\\mathbb\{S\}\-arch \(Sec\.[3\.3](https://arxiv.org/html/2605.11125#S3.SS3)\)\. Exact velocity \([15](https://arxiv.org/html/2605.11125#S3.E15)\) and stochastic decoding \(Algo\.[3](https://arxiv.org/html/2605.11125#alg3),*Stoch\.*\) plateau near12%12\\%\. Restricting the velocity to the top\-kkentries ofp1\|tθp^\{\\theta\}\_\{1\|t\}improves the accuracy, with top\-11reaching∼18%\\sim 18\\%\.Right:𝕊\\mathbb\{S\}\-FLM \(with the𝕊\\mathbb\{S\}\-arch\) vs\. MDLM and Duo\. With the exact velocity,𝕊\\mathbb\{S\}\-FLM beats both baselines atNFE≤16\\mathrm\{NFE\}\\leq 16\.![Refer to caption](https://arxiv.org/html/2605.11125v1/x3.png)Figure 2:𝕊\\mathbb\{S\}\-FLM overview\.Training \(top\):we embed each token as a unit\-norm vector on𝕊d−1\\mathbb\{S\}^\{d\-1\}\. We obtain the noisy latent𝐳tℓ\\mathbf\{z\}\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}by SLERP between the clean embedding and a random vector on𝕊d−1\\mathbb\{S\}^\{d\-1\}\. We train the denoiserp1\|tθp^\{\\theta\}\_\{1\|t\}with cross\-entropy\.Sampling \(bottom\):p1\|tθp^\{\\theta\}\_\{1\|t\}defines a velocity field by marginalizing over tangent vectors pointing toward each clean embedding𝐞^v\\hat\{\\mathbf\{e\}\}\_\{v\},v∈𝒱v\\in\\mathcal\{V\}\. Starting from uniform noise on𝕊d−1\\mathbb\{S\}^\{d\-1\}, we integrate along the velocity field and decode the final latent viaarg⁡maxv∈𝒱⁡p1\|1θ\(v∣𝐳1\)\\arg\\max\_\{v\\in\\mathcal\{V\}\}p^\{\\theta\}\_\{1\|1\}\(v\\mid\\mathbf\{z\}\_\{1\}\)\.## 1Introduction

Autoregressive \(AR\) models currently dominate language modeling\. Thanks to the chain\-rule factorization and the Transformer architecture\([vaswani2017attentionneed,](https://arxiv.org/html/2605.11125#bib.bib89)\), the AR likelihood is fast to evaluate, and AR language models scale to large sizes\([kaplan2020scalinglawsneurallanguage,](https://arxiv.org/html/2605.11125#bib.bib39);[openai2024gpt4technicalreport,](https://arxiv.org/html/2605.11125#bib.bib63);[openai2024gptoss,](https://arxiv.org/html/2605.11125#bib.bib64);[grattafiori2024llama3herdmodels,](https://arxiv.org/html/2605.11125#bib.bib57);[geminiteam2025gemini,](https://arxiv.org/html/2605.11125#bib.bib29);[gemmateam2025gemma3technicalreport,](https://arxiv.org/html/2605.11125#bib.bib31)\)\. However, during sampling, AR models need one forward pass per token, and causal attention can hurt on reasoning tasks, where bidirectional context is required\([papadopoulos2024arrowstimelargelanguage,](https://arxiv.org/html/2605.11125#bib.bib65);[kitouni2024factorizationcursetokenspredict,](https://arxiv.org/html/2605.11125#bib.bib44);[zhangli2024reversenumberdecodingorder,](https://arxiv.org/html/2605.11125#bib.bib98);[nagarajan2025roll,](https://arxiv.org/html/2605.11125#bib.bib60)\)\.

Discrete diffusion models\([austin2023structureddenoisingdiffusionmodels,](https://arxiv.org/html/2605.11125#bib.bib5);[campbell2022continuoustimeframeworkdiscrete,](https://arxiv.org/html/2605.11125#bib.bib9);[sahoo2024simpleeffectivemaskeddiffusion,](https://arxiv.org/html/2605.11125#bib.bib76);[gat2024discreteflowmatching,](https://arxiv.org/html/2605.11125#bib.bib28);[sahoo2025diffusionduality,](https://arxiv.org/html/2605.11125#bib.bib77);[shi2025simplifiedgeneralizedmaskeddiffusion,](https://arxiv.org/html/2605.11125#bib.bib82);[nie2025scalingmaskeddiffusionmodels,](https://arxiv.org/html/2605.11125#bib.bib61);[vonruette2026scalingbehaviordiscretediffusion,](https://arxiv.org/html/2605.11125#bib.bib93);[sahoo2026scalingmaskeddiffusionlanguage,](https://arxiv.org/html/2605.11125#bib.bib78);[wu2025fastdllmtrainingfreeaccelerationdiffusion,](https://arxiv.org/html/2605.11125#bib.bib95)\)approach AR models in Generative Perplexity \(Gen\. PPL\), with parallel generation and bidirectional context\. However, at each denoising step, tokens are*sampled*from factorized marginals rather than jointly\. The factorization makes discrete diffusion less expressive than AR models when generating tokens in parallel\.

Continuous flows trained with Flow Matching\([lipman2023flowmatchinggenerativemodeling,](https://arxiv.org/html/2605.11125#bib.bib47);[liu2022flowstraightfastlearning,](https://arxiv.org/html/2605.11125#bib.bib49);[albergo2023buildingnormalizingflowsstochastic,](https://arxiv.org/html/2605.11125#bib.bib1)\)learn a velocity field that defines an*Ordinary Differential Equation*\(ODE\), transporting noisy samples to the data distribution\. Thus, inference steps update all positions jointly and avoid the factorized sampling issue of discrete diffusion\.

Recent work\([roos2026categoricalflowmaps,](https://arxiv.org/html/2605.11125#bib.bib75);[lee2026onesteplanguagemodelingcontinuous,](https://arxiv.org/html/2605.11125#bib.bib45);[potaptchik2026discreteflowmaps,](https://arxiv.org/html/2605.11125#bib.bib68)\)revived the interest in flow\-based language models\([li2022diffusionlmimprovescontrollabletext,](https://arxiv.org/html/2605.11125#bib.bib46);[dieleman2022continuousdiffusioncategoricaldata,](https://arxiv.org/html/2605.11125#bib.bib22);[gulrajani2023likelihoodbaseddiffusionlanguagemodels,](https://arxiv.org/html/2605.11125#bib.bib33)\), known as*Flow Language Models*\(FLMs\)\. Recent FLMs represent tokens as one\-hot vectors, add Gaussian noise, and train a denoiser with Cross\-Entropy \(CE\)\. Although they match the Gen\. PPL of AR and discrete diffusion models, these FLMs have two main shortcomings\.\(1\)Firstly, representing tokens as one\-hot vectors is costly\.*Large Language Models*\(LLMs\) commonly use vocabularies containing 100k–200k tokens\([openai2024gptoss,](https://arxiv.org/html/2605.11125#bib.bib64);[qwen2025qwen25technicalreport,](https://arxiv.org/html/2605.11125#bib.bib70)\), thus large FLMs would need to store a\>100\>100k\-dimensional vector for every token\. After adding Gaussian noise to these one\-hot vectors, the denoiser multiplies them with the embedding matrix instead of looking up a single vector\. Therefore, FLMs are slower to train than discrete diffusion and AR models\.\(2\)Secondly, in images, Gaussian diffusion smoothly degrades higher\-frequency components first\. For one\-hot vectors, the interpretation of adding Gaussian noise is not as clear\.

#### Contributions

We propose the*Hyperspherical Flow Language Model*\(𝕊\\mathbb\{S\}\-FLM\), a flow over embeddings that does not need to materialize one\-hot vectors\.\(1\)Recall that the cosine distance captures the similarity between token embeddings better than the Euclidean distance\([mikolov2013efficientestimationwordrepresentations,](https://arxiv.org/html/2605.11125#bib.bib58);[pennington2014glove,](https://arxiv.org/html/2605.11125#bib.bib67);[wang2020understandingcontrastiverepresentationlearning,](https://arxiv.org/html/2605.11125#bib.bib94)\)\. Since the cosine distance is determined by the arc length on the unit hypersphere, we implement𝕊\\mathbb\{S\}\-FLM as a Riemannian flow on𝕊d−1\\mathbb\{S\}^\{d\-1\}\. Our forward process transports unit\-norm embeddings toward a uniform prior\.\(2\)𝕊\\mathbb\{S\}\-FLM operates ondd\-dimensional embeddings rather than\|𝒱\|\|\\mathcal\{V\}\|\-dimensional one\-hot vectors\. Thus, assuming the same backbone,𝕊\\mathbb\{S\}\-FLM has a similar training cost as discrete diffusion models \(unlike FLMs over one\-hot vectors, which are costlier\)\. We further introduce the𝕊\\mathbb\{S\}\-arch, a backbone whose activations lie on𝕊d−1\\mathbb\{S\}^\{d\-1\}\. Aligning the activations with the input improves the sample quality on GSM8K and OpenWebText \(OWT\)\([Gokaslan2019OpenWeb,](https://arxiv.org/html/2605.11125#bib.bib30)\)\.\(3\)On GSM8K, the accuracy of prior FLMs trained on TinyGSM\([liu2023tinygsm,](https://arxiv.org/html/2605.11125#bib.bib48)\)is less than1%1\\%\. In contrast,𝕊\\mathbb\{S\}\-FLM reaches∼12%\\sim 12\\%with the exact velocity and∼18%\\sim 18\\%with the top\-11velocity \(Sec\.[3\.1](https://arxiv.org/html/2605.11125#S3.SS1.SSS0.Px3)\)\. At standard\-temperature sampling \(T=1T=1\),𝕊\\mathbb\{S\}\-FLM closes the gap to MDLM and Duo with the top\-11velocity across*Number of Function Evaluations*\(NFE\) budgets, and outperforms them with the exact velocity atNFE≤16\\mathrm\{NFE\}\\leq 16\(Figure[1](https://arxiv.org/html/2605.11125#S0.F1)\)\. A gap remains at low\-temperature \(T=0\.1T=0\.1\) decoding, where MDLM and Duo reach3333–36%36\\%\(Figure[3](https://arxiv.org/html/2605.11125#S4.F3)\)\.

## 2Background

#### Notation

We denote by𝒱\\mathcal\{V\}the finite vocabulary of size\|𝒱\|\|\\mathcal\{V\}\|\. Boldface letters denote either sequences𝐱∈𝒱L\{\\mathbf\{x\}\}\\in\\mathcal\{V\}^\{L\}ofLLtokens, with𝐱ℓ\{\\mathbf\{x\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}theℓ\\ell\-th element, or vectors inℝd\\mathbb\{R\}^\{d\}, the meaning clear from context\. We write𝕊d−1:=\{𝐱∈ℝd:‖𝐱‖=1\}\\mathbb\{S\}^\{d\-1\}:=\\\{\{\\mathbf\{x\}\}\\in\\mathbb\{R\}^\{d\}:\\\|\{\\mathbf\{x\}\}\\\|=1\\\}for the unit hypersphere inℝd\\mathbb\{R\}^\{d\}, and𝒰\(𝕊d−1\)\\mathcal\{U\}\(\\mathbb\{S\}^\{d\-1\}\)for the uniform distribution on𝕊d−1\\mathbb\{S\}^\{d\-1\}\. Token embeddings are stored in a lookup table𝐄∈ℝ\|𝒱\|×d\\mathbf\{E\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\}and we write𝐞v∈ℝd\\mathbf\{e\}\_\{v\}\\in\\mathbb\{R\}^\{d\}for the row associated withv∈𝒱v\\in\\mathcal\{V\}\.

#### Language Modeling and Autoregressive Models

Language models approximate the data distributionpdata:𝒱L→\[0,1\]p\_\{\\text\{data\}\}:\\mathcal\{V\}^\{L\}\\rightarrow\[0,1\]over sequences with a densitypθ\(𝐱\)p\_\{\\theta\}\(\{\\mathbf\{x\}\}\)\. AR models factorizepθ\(𝐱\)=∏ℓ=1Lpθ\(𝐱ℓ∣𝐱<ℓ\)p\_\{\\theta\}\(\{\\mathbf\{x\}\}\)=\\prod\_\{\\ell=1\}^\{L\}p\_\{\\theta\}\(\{\\mathbf\{x\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\mid\{\\mathbf\{x\}\}^\{<\\ell\}\)using the chain rule of probability\. This factorization enables exact likelihood training, but implies that inference is slow, as we generate tokens one by one, and that we cannot condition on future tokens\.

### 2\.1Flow Generative Modeling

Continuous Normalizing Flows \(CNFs\)\([Chen2018NeuralOD,](https://arxiv.org/html/2605.11125#bib.bib12);[grathwohl2018ffjordfreeformcontinuousdynamics,](https://arxiv.org/html/2605.11125#bib.bib32)\)are generative models onℝd\\mathbb\{R\}^\{d\}\. CNFs learn a continuous\-time transport from a noise distributionp0=pnoisep\_\{0\}=p\_\{\\text\{noise\}\}to the data distributionp1=pdatap\_\{1\}=p\_\{\\text\{data\}\}\. The time\-dependent velocity fieldutθ:ℝd→ℝdu^\{\\theta\}\_\{t\}\\colon\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}induces the*flow*ϕt:ℝd→ℝd\\phi\_\{t\}\\colon\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}by integration:

ddtϕt\(z0\)=utθ\(ϕt\(z0\)\),ϕ0\(z0\)=z0,\\frac\{d\}\{dt\}\\phi\_\{t\}\(z\_\{0\}\)=u^\{\\theta\}\_\{t\}\(\\phi\_\{t\}\(z\_\{0\}\)\),\\quad\\phi\_\{0\}\(z\_\{0\}\)=z\_\{0\},\(1\)The velocity fieldutθu^\{\\theta\}\_\{t\}is parameterized by a neural network with continuously differentiable, bounded derivatives, so the ODE \([1](https://arxiv.org/html/2605.11125#S2.E1)\) has a unique solution\([coddington1955theory,](https://arxiv.org/html/2605.11125#bib.bib15)\)\. Fort∈\[0,1\]t\\in\[0,1\], the pushforwardpt=\[ϕt\]♯p0p\_\{t\}=\[\\phi\_\{t\}\]\_\{\\sharp\}\\,p\_\{0\}defines intermediate densitiesptp\_\{t\}, and the family\{pt\}t∈\[0,1\]\\\{p\_\{t\}\\\}\_\{t\\in\[0,1\]\}is called a*probability path*fromp0p\_\{0\}top1p\_\{1\}\. Integrating \([1](https://arxiv.org/html/2605.11125#S2.E1)\) up tottproduces a samplezt=ϕt\(z0\)∼ptz\_\{t\}=\\phi\_\{t\}\(z\_\{0\}\)\\sim p\_\{t\}\.

#### Flow Matching

Flow Matching \(FM\)\([lipman2023flowmatchinggenerativemodeling,](https://arxiv.org/html/2605.11125#bib.bib47);[liu2022flowstraightfastlearning,](https://arxiv.org/html/2605.11125#bib.bib49);[albergo2023buildingnormalizingflowsstochastic,](https://arxiv.org/html/2605.11125#bib.bib1)\)is a method to learn the velocity fieldutθu^\{\\theta\}\_\{t\}that transportsp0p\_\{0\}top1p\_\{1\}\(Suppl\.[A](https://arxiv.org/html/2605.11125#A1)\)\. The*true*velocityutu\_\{t\}is typically expressed as an expectation:

ut\(zt\)=∫ut\|1\(zt∣x\)p1\|t\(x∣zt\)𝑑x,u\_\{t\}\(z\_\{t\}\)=\\int u\_\{t\|1\}\(z\_\{t\}\\mid x\)\\,p\_\{1\|t\}\(x\\mid z\_\{t\}\)\\,dx,\(2\)whereut\|1u\_\{t\|1\}is a*conditional*velocity, conditioned onx∼pdatax\\sim p\_\{\\text\{data\}\}, andp1\|tp\_\{1\|t\}is the posterior given the noisy exampleztz\_\{t\}\. Sincep1\|tp\_\{1\|t\}is generally intractable, FM trainsutθu^\{\\theta\}\_\{t\}againstut\|1u\_\{t\|1\}instead\. Letψt\|1\\psi\_\{t\|1\}denote the*conditional*flow associated withut\|1u\_\{t\|1\}\. A common choice is the linear interpolation:

ψt\|1\(z0∣x\)=zt=αtx\+\(1−αt\)z0,\\psi\_\{t\|1\}\(z\_\{0\}\\mid x\)=z\_\{t\}=\\alpha\_\{t\}\\,x\+\(1\-\\alpha\_\{t\}\)\\,z\_\{0\},\(3\)wherez0∼p0z\_\{0\}\\sim p\_\{0\}andx∼p1x\\sim p\_\{1\}\.αt:\[0,1\]→\[0,1\]\\alpha\_\{t\}\\colon\[0,1\]\\to\[0,1\]is a monotonically increasing*noise schedule*withα0=0\\alpha\_\{0\}=0andα1=1\\alpha\_\{1\}=1\. The conditional velocity is given by the time derivative ofψt\|1\\psi\_\{t\|1\}:

ut\|1\(zt∣x\)=α˙t\(x−z0\)\.u\_\{t\|1\}\(z\_\{t\}\\mid x\)=\\dot\{\\alpha\}\_\{t\}\\,\(x\-z\_\{0\}\)\.\(4\)A key result in FM\([lipman2023flowmatchinggenerativemodeling,](https://arxiv.org/html/2605.11125#bib.bib47)\)is that the minimizer of the*Conditional Flow Matching*\(CFM\) loss is the marginal velocityutu\_\{t\}\([2](https://arxiv.org/html/2605.11125#S2.E2)\):

ℒCFM\(θ\)=𝔼t∼𝒰\[0,1\],z0∼p0,x∼p1∥utθ\(zt\)−ut\|1\(zt∣x\)∥2\.\\mathcal\{L\}\_\{\\text\{CFM\}\}\(\\theta\)=\\mathbb\{E\}\_\{t\\sim\\mathcal\{U\}\[0,1\],\\,z\_\{0\}\\sim p\_\{0\},\\,x\\sim p\_\{1\}\}\\left\\\|u^\{\\theta\}\_\{t\}\(z\_\{t\}\)\-u\_\{t\|1\}\(z\_\{t\}\\mid x\)\\right\\\|^\{2\}\.\(5\)

### 2\.2Geometry of the Hypersphere

We summarize the primitives we use on𝕊d−1\\mathbb\{S\}^\{d\-1\}\. For a thorough treatment of Riemannian geometry, see do Carmo\([docarmo1992riemannian,](https://arxiv.org/html/2605.11125#bib.bib23)\)\. The*geodesic distance*between𝐩,𝐪∈𝕊d−1\\mathbf\{p\},\\mathbf\{q\}\\in\\mathbb\{S\}^\{d\-1\}is the angled𝕊\(𝐩,𝐪\)=arccos⁡\(𝐩⊤𝐪\)∈\[0,π\]d\_\{\\mathbb\{S\}\}\(\\mathbf\{p\},\\mathbf\{q\}\)=\\arccos\(\\mathbf\{p\}^\{\\top\}\\mathbf\{q\}\)\\in\[0,\\pi\]\. The*tangent space*at𝐩\\mathbf\{p\}isT𝐩𝕊d−1=\{𝐯∈ℝd:𝐯⊤𝐩=0\}T\_\{\\mathbf\{p\}\}\\mathbb\{S\}^\{d\-1\}=\\\{\\mathbf\{v\}\\in\\mathbb\{R\}^\{d\}:\\mathbf\{v\}^\{\\top\}\\mathbf\{p\}=0\\\}\. The*exponential map*exp𝐩:T𝐩𝕊d−1→𝕊d−1\\exp\_\{\\mathbf\{p\}\}\\colon T\_\{\\mathbf\{p\}\}\\mathbb\{S\}^\{d\-1\}\\to\\mathbb\{S\}^\{d\-1\}moves𝐩\\mathbf\{p\}along the geodesic in direction𝐯\\mathbf\{v\}for arc length‖𝐯‖\\\|\\mathbf\{v\}\\\|:

exp𝐩⁡\(𝐯\)=cos⁡\(‖𝐯‖\)𝐩\+sin⁡\(‖𝐯‖\)𝐯‖𝐯‖\.\\exp\_\{\\mathbf\{p\}\}\(\\mathbf\{v\}\)=\\cos\(\\\|\\mathbf\{v\}\\\|\)\\,\\mathbf\{p\}\+\\sin\(\\\|\\mathbf\{v\}\\\|\)\\,\\frac\{\\mathbf\{v\}\}\{\\\|\\mathbf\{v\}\\\|\}\.\(6\)The*logarithmic map*log𝐩:𝕊d−1→T𝐩𝕊d−1\\log\_\{\\mathbf\{p\}\}\\colon\\mathbb\{S\}^\{d\-1\}\\to T\_\{\\mathbf\{p\}\}\\mathbb\{S\}^\{d\-1\}inverts the exponential map and returns the tangent vector at𝐩\\mathbf\{p\}pointing toward𝐪\\mathbf\{q\}, with magnituded𝕊\(𝐩,𝐪\)d\_\{\\mathbb\{S\}\}\(\\mathbf\{p\},\\mathbf\{q\}\)111When𝐩=−𝐪\\mathbf\{p\}=\-\\mathbf\{q\}\(antipodal points\),log𝐩⁡\(𝐪\)\\log\_\{\\mathbf\{p\}\}\(\\mathbf\{q\}\)is undefined since infinitely many geodesics connect them\.:

log𝐩⁡\(𝐪\)=ωsin⁡ω\(𝐪−cos⁡\(ω\)𝐩\),ω=d𝕊\(𝐩,𝐪\)\.\\log\_\{\\mathbf\{p\}\}\(\\mathbf\{q\}\)=\\frac\{\\omega\}\{\\sin\\omega\}\\left\(\\mathbf\{q\}\-\\cos\(\\omega\)\\,\\mathbf\{p\}\\right\),\\quad\\omega=d\_\{\\mathbb\{S\}\}\(\\mathbf\{p\},\\mathbf\{q\}\)\.\(7\)The*Spherical Linear Interpolation*\(SLERP\) follows the geodesic from𝐩\\mathbf\{p\}to𝐪\\mathbf\{q\}\([SLERP,](https://arxiv.org/html/2605.11125#bib.bib83)\):

SLERP\(𝐩,𝐪,t\)=sin⁡\(\(1−t\)ω\)sin⁡ω𝐩\+sin⁡\(tω\)sin⁡ω𝐪,ω=d𝕊\(𝐩,𝐪\),\\mathrm\{SLERP\}\(\\mathbf\{p\},\\mathbf\{q\},t\)=\\frac\{\\sin\(\(1\{\-\}t\)\\,\\omega\)\}\{\\sin\\omega\}\\,\\mathbf\{p\}\+\\frac\{\\sin\(t\\,\\omega\)\}\{\\sin\\omega\}\\,\\mathbf\{q\},\\quad\\omega=d\_\{\\mathbb\{S\}\}\(\\mathbf\{p\},\\mathbf\{q\}\),\(8\)or equivalently,SLERP\(𝐩,𝐪,t\)=exp𝐩⁡\(tlog𝐩⁡\(𝐪\)\)\\mathrm\{SLERP\}\(\\mathbf\{p\},\\mathbf\{q\},t\)=\\exp\_\{\\mathbf\{p\}\}\\\!\\left\(t\\,\\log\_\{\\mathbf\{p\}\}\(\\mathbf\{q\}\)\\right\)\. To sample uniformly on𝕊d−1\\mathbb\{S\}^\{d\-1\}, drawϵ∼𝒩\(𝟎,𝐈d\)\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{d\}\)and normalizeϵ←ϵ/‖ϵ‖\\bm\{\\epsilon\}\\leftarrow\\bm\{\\epsilon\}/\\\|\\bm\{\\epsilon\}\\\|\([muller1959note,](https://arxiv.org/html/2605.11125#bib.bib59)\)\.

### 2\.3Flow Matching on the Hypersphere

Riemannian Flow Matching \(RFM\)\([chen2024flowmatchinggeneralgeometries,](https://arxiv.org/html/2605.11125#bib.bib11)\)extends FM to Riemannian manifolds, which include𝕊d−1\\mathbb\{S\}^\{d\-1\}\. As in the Euclidean case, the marginal velocity is the expectation of a conditional velocity:

ut\(𝐳t\)=∫ut\|1\(𝐳t∣𝐳1\)p1\|t\(𝐳1∣𝐳t\)𝑑𝐳1,u\_\{t\}\(\\mathbf\{z\}\_\{t\}\)=\\int u\_\{t\|1\}\(\\mathbf\{z\}\_\{t\}\\mid\\mathbf\{z\}\_\{1\}\)\\,p\_\{1\|t\}\(\\mathbf\{z\}\_\{1\}\\mid\\mathbf\{z\}\_\{t\}\)\\,d\\mathbf\{z\}\_\{1\},\(9\)whereut\|1u\_\{t\|1\}is the conditional velocity associated with the conditional flowψt\|1\\psi\_\{t\|1\}, andp1\|tp\_\{1\|t\}is the posterior given𝐳t\\mathbf\{z\}\_\{t\}\. We defineψt\|1\\psi\_\{t\|1\}with SLERP \([8](https://arxiv.org/html/2605.11125#S2.E8)\), where𝐳0∼p0\\mathbf\{z\}\_\{0\}\\sim p\_\{0\}is drawn from a noise distribution on𝕊d−1\\mathbb\{S\}^\{d\-1\}and𝐳1∼p1\\mathbf\{z\}\_\{1\}\\sim p\_\{1\}is a data sample:

ψt\|1\(𝐳0∣𝐳1\)=𝐳t=SLERP\(𝐳0,𝐳1,αt\)=exp𝐳0⁡\(αtlog𝐳0⁡\(𝐳1\)\),\\psi\_\{t\|1\}\(\\mathbf\{z\}\_\{0\}\\mid\\mathbf\{z\}\_\{1\}\)=\\mathbf\{z\}\_\{t\}=\\mathrm\{SLERP\}\(\\mathbf\{z\}\_\{0\},\\mathbf\{z\}\_\{1\},\\alpha\_\{t\}\)=\\exp\_\{\\mathbf\{z\}\_\{0\}\}\\\!\\left\(\\alpha\_\{t\}\\,\\log\_\{\\mathbf\{z\}\_\{0\}\}\(\\mathbf\{z\}\_\{1\}\)\\right\),\(10\)which satisfiesψ0\|1\(𝐳0∣𝐳1\)=𝐳0\\psi\_\{0\|1\}\(\\mathbf\{z\}\_\{0\}\\mid\\mathbf\{z\}\_\{1\}\)=\\mathbf\{z\}\_\{0\}andψ1\|1\(𝐳0∣𝐳1\)=𝐳1\\psi\_\{1\|1\}\(\\mathbf\{z\}\_\{0\}\\mid\\mathbf\{z\}\_\{1\}\)=\\mathbf\{z\}\_\{1\}\. Differentiatingψt\|1\\psi\_\{t\|1\}gives the conditional velocity \(Suppl\.[A\.2](https://arxiv.org/html/2605.11125#A1.SS2)\):

ut\|1\(𝐳t∣𝐳1\)=α˙t1−αtlog𝐳t⁡\(𝐳1\)\.u\_\{t\|1\}\(\\mathbf\{z\}\_\{t\}\\mid\\mathbf\{z\}\_\{1\}\)=\\frac\{\\dot\{\\alpha\}\_\{t\}\}\{1\-\\alpha\_\{t\}\}\\,\\log\_\{\\mathbf\{z\}\_\{t\}\}\(\\mathbf\{z\}\_\{1\}\)\.\(11\)As in the Euclidean case, the minimizer of the Riemannian Conditional Flow Matching \(RCFM\) loss is the marginal velocityutu\_\{t\}\([9](https://arxiv.org/html/2605.11125#S2.E9)\):

ℒRCFM\(θ\)=𝔼t∼𝒰\[0,1\],𝐳0∼p0,𝐳1∼p1∥utθ\(𝐳t\)−ut\|1\(𝐳t∣𝐳1\)∥2\.\\mathcal\{L\}\_\{\\text\{RCFM\}\}\(\\theta\)=\\mathbb\{E\}\_\{t\\sim\\mathcal\{U\}\[0,1\],\\,\\mathbf\{z\}\_\{0\}\\sim p\_\{0\},\\,\\mathbf\{z\}\_\{1\}\\sim p\_\{1\}\}\\left\\\|u^\{\\theta\}\_\{t\}\(\\mathbf\{z\}\_\{t\}\)\-u\_\{t\|1\}\(\\mathbf\{z\}\_\{t\}\\mid\\mathbf\{z\}\_\{1\}\)\\right\\\|^\{2\}\.\(12\)

## 3Hyperspherical Flow Language Models

Figure[2](https://arxiv.org/html/2605.11125#S0.F2)overviews the training and sampling\.𝕊\\mathbb\{S\}\-FLM is a Riemannian CNF on\(𝕊d−1\)L\(\\mathbb\{S\}^\{d\-1\}\)^\{L\}that transports a sequence of random vectors on𝕊d−1\\mathbb\{S\}^\{d\-1\}towards the clean token representation\. To bridge the continuous and discrete representations, we map tokens to the sphere via a normalized embedding lookup and decode via thearg⁡max\\arg\\maxofp1\|tθp^\{\\theta\}\_\{1\|t\}\. The denoiserp1\|tθp^\{\\theta\}\_\{1\|t\}, trained with cross\-entropy, induces a closed\-form marginal velocity field that we integrate at sampling time\. Optionally, after each optimization step, we re\-project the embeddings to𝕊d−1\\mathbb\{S\}^\{d\-1\}\(Suppl\.[B\.5](https://arxiv.org/html/2605.11125#A2.SS5)\)\.

#### Encoder and Decoder

Let𝐱=\(x1,…,xL\)∈𝒱L\\mathbf\{x\}=\(x^\{1\},\\ldots,x^\{L\}\)\\in\\mathcal\{V\}^\{L\}be an input sequence ofLLtokens\. Each tokenv∈𝒱v\\in\\mathcal\{V\}is associated with a unit\-norm embedding𝐞^v∈𝕊d−1\\hat\{\\mathbf\{e\}\}\_\{v\}\\in\\mathbb\{S\}^\{d\-1\}, hence we represent𝐱\\mathbf\{x\}as a sequence in\(𝕊d−1\)L\(\\mathbb\{S\}^\{d\-1\}\)^\{L\}\. The decoderg:\(𝕊d−1\)L×\[0,1\]→𝒱Lg\\colon\(\\mathbb\{S\}^\{d\-1\}\)^\{L\}\\times\[0,1\]\\to\\mathcal\{V\}^\{L\}is defined as thearg⁡max\\arg\\maxof the learned posterior:

𝐞v=𝐄\[v\],𝐞^v=𝐞v‖𝐞v‖2,gℓ\(𝐳,t\)=arg⁡maxv∈𝒱⁡p1\|tθ\(𝐱ℓ=v∣𝐳\),\\mathbf\{e\}\_\{v\}=\\mathbf\{E\}\[v\],\\qquad\\hat\{\\mathbf\{e\}\}\_\{v\}=\\frac\{\\mathbf\{e\}\_\{v\}\}\{\\\|\\mathbf\{e\}\_\{v\}\\\|\_\{2\}\},\\qquad g^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\(\\mathbf\{z\},t\)=\\arg\\max\_\{v\\in\\mathcal\{V\}\}\\,p^\{\\theta\}\_\{1\|t\}\(\{\\mathbf\{x\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}=v\\mid\\mathbf\{z\}\),\(13\)

#### Training with Cross\-Entropy

Unlike standard flow matching on fixed data \(e\.g\., pixels\), we train the data representation via an embedding table𝐄\\mathbf\{E\}jointly with the flow by backpropagating through the SLERP\. Regressing the velocity field against learnable embeddings admits a trivial minimum where all token representations collapse to a point\. Instead, we approximate the posteriorp1\|t\(⋅∣𝐳t\)p\_\{1\|t\}\(\\cdot\\mid\\mathbf\{z\}\_\{t\}\)with a denoiserp1\|tθ\(⋅∣𝐳t\)p^\{\\theta\}\_\{1\|t\}\(\\cdot\\mid\\mathbf\{z\}\_\{t\}\), trained with cross\-entropy \(CE\)\. The denoiser cannot recover the clean token if embeddings collapse, thus the CE pushes them apart, as noted in CDCD\([dieleman2022continuousdiffusioncategoricaldata,](https://arxiv.org/html/2605.11125#bib.bib22)\)\. At every positionℓ\\ell, we take𝐳1ℓ=𝐞^𝐱ℓ\\mathbf\{z\}\_\{1\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}=\\hat\{\\mathbf\{e\}\}\_\{\{\\mathbf\{x\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\}\([13](https://arxiv.org/html/2605.11125#S3.E13)\) as the clean endpoint, draw𝐳0ℓ∼𝒰\(𝕊d−1\)\\mathbf\{z\}\_\{0\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\sim\\mathcal\{U\}\(\\mathbb\{S\}^\{d\-1\}\)independently, and form the noisy latent𝐳tℓ=SLERP\(𝐳0ℓ,𝐳1ℓ,αt\)\\mathbf\{z\}\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}=\\mathrm\{SLERP\}\(\\mathbf\{z\}\_\{0\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\},\\mathbf\{z\}\_\{1\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\},\\alpha\_\{t\}\)\([10](https://arxiv.org/html/2605.11125#S2.E10)\)\. We minimize the cross\-entropy

ℒCE\(θ\)=𝔼𝐱∼p1,t∼𝒰\[0,1\],𝐳0∼p0\[−∑ℓ=1Llog⁡p1\|tθ\(𝐱ℓ∣𝐳t\)\]\.\\mathcal\{L\}\_\{\\text\{CE\}\}\(\\theta\)=\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim p\_\{1\},\\,t\\sim\\mathcal\{U\}\[0,1\],\\,\\mathbf\{z\}\_\{0\}\\sim p\_\{0\}\}\\left\[\-\\sum\_\{\\ell=1\}^\{L\}\\log p^\{\\theta\}\_\{1\|t\}\(\{\\mathbf\{x\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\mid\\mathbf\{z\}\_\{t\}\)\\right\]\.\(14\)

### 3\.1Sampling

#### Exact velocity

Because our data distribution is supported on\{𝐞^v:v∈𝒱\}L⊂\(𝕊d−1\)L\\\{\\hat\{\\mathbf\{e\}\}\_\{v\}:v\\in\\mathcal\{V\}\\\}^\{L\}\\subset\(\\mathbb\{S\}^\{d\-1\}\)^\{L\}, the marginal velocity \([9](https://arxiv.org/html/2605.11125#S2.E9)\) reduces to a finite sum at each position:

utθ\(𝐳tℓ\)=α˙t1−αt∑v∈𝒱p1\|tθ\(𝐱ℓ=v∣𝐳t\)log𝐳tℓ⁡\(𝐞^v\)\.u^\{\\theta\}\_\{t\}\(\\mathbf\{z\}\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\)=\\frac\{\\dot\{\\alpha\}\_\{t\}\}\{1\-\\alpha\_\{t\}\}\\sum\_\{v\\in\\mathcal\{V\}\}p^\{\\theta\}\_\{1\|t\}\(\{\\mathbf\{x\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}=v\\mid\\mathbf\{z\}\_\{t\}\)\\,\\log\_\{\\mathbf\{z\}\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\}\\\!\\left\(\\hat\{\\mathbf\{e\}\}\_\{v\}\\right\)\.\(15\)We call this the*exact velocity*, in contrast to the stochastic and top\-kkapproximations below\.

#### Stochastic decoding

We can replace the sum in \([15](https://arxiv.org/html/2605.11125#S3.E15)\) with a single Monte Carlo sample of the posterior\. At each step, we draw𝐱^ℓ∼p1\|tθ\(⋅∣𝐳t\)\\hat\{\{\\mathbf\{x\}\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\sim p^\{\\theta\}\_\{1\|t\}\(\\cdot\\mid\\mathbf\{z\}\_\{t\}\)and use𝐯¯ℓ=log𝐳tℓ⁡\(𝐞^𝐱^ℓ\)\\bar\{\\mathbf\{v\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}=\\log\_\{\\mathbf\{z\}\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\}\(\\hat\{\\mathbf\{e\}\}\_\{\\hat\{\{\\mathbf\{x\}\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\}\)in place of \([15](https://arxiv.org/html/2605.11125#S3.E15)\)\. The deterministic and stochastic samplers differ only in how they construct the velocity fromp1\|tθp^\{\\theta\}\_\{1\|t\}\. CANDI\([pynadath2025candihybriddiscretecontinuousdiffusion,](https://arxiv.org/html/2605.11125#bib.bib69)\)uses an analogous one\-sample Monte Carlo approximation\. See Algo\.[3](https://arxiv.org/html/2605.11125#alg3)for the pseudocode\.

#### Top\-kkvelocity

Alternatively, we can restrict the sum in \([15](https://arxiv.org/html/2605.11125#S3.E15)\) to the top\-kkentries ofp1\|tθ\(⋅∣𝐳t\)p^\{\\theta\}\_\{1\|t\}\(\\cdot\\mid\\mathbf\{z\}\_\{t\}\)at each sampling step\. We take the top\-kklogits and apply log\-softmax to renormalize over the truncated set\. We call this*top\-kkvelocity decoding*, withk=1k=1giving*top\-11*decoding, the analogue of greedy decoding for autoregressive models\.

### 3\.2Noise Schedule

#### Truncation

After training with standard noise schedules \(Table[5](https://arxiv.org/html/2605.11125#A3.T5)\), we observe thatp1\|tθp^\{\\theta\}\_\{1\|t\}becomes close to one\-hot in few sampling steps\. This is likely due to the*curse of dimensionality*\([bellman1961adaptive,](https://arxiv.org/html/2605.11125#bib.bib6)\)\. In high dimension,𝕊d−1\\mathbb\{S\}^\{d\-1\}has enough room to spread\|𝒱\|\|\\mathcal\{V\}\|embeddings apart\([vershynin2018high,](https://arxiv.org/html/2605.11125#bib.bib91)\)\. Training with CE separates the embeddings because the denoiser cannot differentiate tokens whose embeddings coincide\([wang2020understandingcontrastiverepresentationlearning,](https://arxiv.org/html/2605.11125#bib.bib94)\)\. When the embeddings are well separated, the true posteriorp1\|tp\_\{1\|t\}at low noise levels \(αt\\alpha\_\{t\}large\) is close to one\-hot\. Thus, during sampling, after𝐳tℓ\\mathbf\{z\}\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}enters the Voronoi cell of a clean embedding𝐞^k\\hat\{\\mathbf\{e\}\}\_\{k\}, the posterior collapses\. To avoid training on noise levels wherep1\|tp\_\{1\|t\}is close to one\-hot, we truncate the noise schedule to\[0,a\]⊆\[0,1\]\[0,a\]\\subseteq\[0,1\], i\.e\. we train only at the higher noise levels\. We propose a closed\-form expression for the truncation boundaaas a function of the vocabulary size\|𝒱\|\|\\mathcal\{V\}\|and embedding dimensionddby analyzing a tractable approximation of the sampling dynamics on𝕊d−1\\mathbb\{S\}^\{d\-1\}\. Truncation is critical for strong performance, as seen in Sudoku \(Table[1](https://arxiv.org/html/2605.11125#S4.T1)\), and after a grid search on GSM8K, our bound achieves similar accuracy as the best truncation value \(Suppl\.[C\.6](https://arxiv.org/html/2605.11125#A3.SS6)\)\.

Tractable model of the sampling dynamics\.Let\{𝐞^v\}v∈𝒱\\\{\\hat\{\\mathbf\{e\}\}\_\{v\}\\\}\_\{v\\in\\mathcal\{V\}\}be the normalized token representations, sampled i\.i\.d\. uniformly on𝕊d−1\\mathbb\{S\}^\{d\-1\}\. Let𝐳0∼𝒰\(𝕊d−1\)\\mathbf\{z\}\_\{0\}\\sim\\mathcal\{U\}\(\\mathbb\{S\}^\{d\-1\}\)be the initial noise sample\. Fix a target tokenkkand let𝐳α=SLERP\(𝐳0,𝐞^k,α\)\\mathbf\{z\}\_\{\\alpha\}=\\mathrm\{SLERP\}\(\\mathbf\{z\}\_\{0\},\\hat\{\\mathbf\{e\}\}\_\{k\},\\alpha\)forα∈\[0,1\]\\alpha\\in\[0,1\]\. Defineα⋆\(δ\)\\alpha^\{\\star\}\(\\delta\)as the smallestα\\alphaat which𝐞^k\\hat\{\\mathbf\{e\}\}\_\{k\}is the nearest neighbor of𝐳α\\mathbf\{z\}\_\{\\alpha\}with probability at least1−δ1\-\\delta\. Then, in high dimension,α⋆\(δ\)≈2πarcsin⁡\(2log⁡\(2\(\|𝒱\|−1\)/δ\)d\)\.\\alpha^\{\\star\}\(\\delta\)\\approx\\frac\{2\}\{\\pi\}\\arcsin\\\!\\left\(\\sqrt\{\\frac\{2\\log\\\!\\bigl\(2\(\|\\mathcal\{V\}\|\-1\)/\\delta\\bigr\)\}\{d\}\}\\right\)\.\(16\)

###### Derivation sketch\.

Under our model, the similarity to non\-target tokens⟨𝐳α,𝐞^v⟩\\langle\\mathbf\{z\}\_\{\\alpha\},\\hat\{\\mathbf\{e\}\}\_\{v\}\\rangle\(withv≠kv\\neq k\) is sub\-Gaussian, as the product of a fixed vector and a uniform random vector on𝕊d−1\\mathbb\{S\}^\{d\-1\}\([vershynin2018high,](https://arxiv.org/html/2605.11125#bib.bib91)\)\. With a simple union bound, we conclude thatmaxv≠k⁡⟨𝐳α,𝐞^v⟩≤2log⁡\(2\(\|𝒱\|−1\)/δ\)/d\\max\_\{v\\neq k\}\\langle\\mathbf\{z\}\_\{\\alpha\},\\hat\{\\mathbf\{e\}\}\_\{v\}\\rangle\\leq\\sqrt\{2\\log\(2\(\|\\mathcal\{V\}\|\-1\)/\\delta\)/d\}with probability at least1−δ1\-\\delta\. Along the sampling trajectory,⟨𝐳α,𝐞^k⟩=cos⁡\(\(1−α\)ω\)\\langle\\mathbf\{z\}\_\{\\alpha\},\\hat\{\\mathbf\{e\}\}\_\{k\}\\rangle=\\cos\(\(1\-\\alpha\)\\omega\)withω=d𝕊\(𝐳0,𝐞^k\)≈π/2\\omega=d\_\{\\mathbb\{S\}\}\(\\mathbf\{z\}\_\{0\},\\hat\{\\mathbf\{e\}\}\_\{k\}\)\\approx\\pi/2in high dimension, thus⟨𝐳α,𝐞^k⟩≈sin⁡\(πα/2\)\\langle\\mathbf\{z\}\_\{\\alpha\},\\hat\{\\mathbf\{e\}\}\_\{k\}\\rangle\\approx\\sin\(\\pi\\alpha/2\)\. We conclude by solving for the critical point such that𝐳α\\mathbf\{z\}\_\{\\alpha\}is closest to𝐞^v\\hat\{\\mathbf\{e\}\}\_\{v\}with high probability\. The complete argument is in Suppl\.[C\.2](https://arxiv.org/html/2605.11125#A3.SS2), and the numerical values in Table[4](https://arxiv.org/html/2605.11125#A3.T4)\. ∎

#### Adaptive noise schedule

On top of truncating to\[0,a\]\[0,a\], we adapt the noise schedule during training to allocate more samples to noise levels where the lossℒ\\mathcal\{L\}changes most, inspired by InfoNoise\([raya2026informationguidednoiseallocation,](https://arxiv.org/html/2605.11125#bib.bib73)\)\. Every 50 steps, we fit the loss profileℒ^\(t\)\\hat\{\\mathcal\{L\}\}\(t\)from recent pairs\(t,ℒ\)\(t,\\mathcal\{L\}\)and define the noise scheduleαt\\alpha\_\{t\}using the inverse CDF of\|dℒ^/dt\|\|d\\hat\{\\mathcal\{L\}\}/dt\|\([dieleman2022continuousdiffusioncategoricaldata,](https://arxiv.org/html/2605.11125#bib.bib22)\)\. We use\|dℒ^/dt\|\|d\\hat\{\\mathcal\{L\}\}/dt\|as a proxy for the critical noise levels where the model learns most, thus sample these more often\. We stabilize the updates with an*exponential moving average*\(EMA\) of the successive schedules\. Find the pseudocode and comparison with InfoNoise in Suppl\.[B\.2](https://arxiv.org/html/2605.11125#A2.SS2)\. In practice, the adaptive schedule does not slow training \(Table[3](https://arxiv.org/html/2605.11125#A2.T3)\)\.

### 3\.3Hyperspherical Architecture

The latents𝐳tℓ\\mathbf\{z\}\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}live on𝕊d−1\\mathbb\{S\}^\{d\-1\}, and the sampling step \([15](https://arxiv.org/html/2605.11125#S3.E15)\) rotates vectors\. We propose a Transformer variant inspired by nGPT\([loshchilov2024ngptnormalizedtransformerrepresentation,](https://arxiv.org/html/2605.11125#bib.bib50)\)that keeps the intermediate activations on𝕊d−1\\mathbb\{S\}^\{d\-1\}and parameterizes the attention and MLP layers as rotations\. Each block replaces the additive residual with a normalized interpolation𝐡out←Norm\(𝐡in\+𝜸⊙\(Norm\(𝐡layer\)−𝐡in\)\)\\mathbf\{h\}\_\{\\mathrm\{out\}\}\\leftarrow\\mathrm\{Norm\}\\bigl\(\\mathbf\{h\}\_\{\\mathrm\{in\}\}\+\\bm\{\\gamma\}\\odot\(\\mathrm\{Norm\}\(\\mathbf\{h\}\_\{\\mathrm\{layer\}\}\)\-\\mathbf\{h\}\_\{\\mathrm\{in\}\}\)\\bigr\), where𝐡in\\mathbf\{h\}\_\{\\mathrm\{in\}\}is the input,𝐡layer\\mathbf\{h\}\_\{\\mathrm\{layer\}\}is the layer \(MLP or attention\) output, and𝐡out\\mathbf\{h\}\_\{\\mathrm\{out\}\}is the updated state\. The normalized interpolation approximates SLERP for small𝜸\\bm\{\\gamma\}\. The per\-dimension gate𝜸\\bm\{\\gamma\}is computed from the noise level, similar to adaptive layernorm \(adaLN\) in DiT\([peebles2023scalablediffusionmodelstransformers,](https://arxiv.org/html/2605.11125#bib.bib66)\)\. We do not use adaLN, so our architecture \(*𝕊\\mathbb\{S\}\-arch*\) has slightly*fewer*parameters than the standard DiT used in discrete diffusion papers\([sahoo2024simpleeffectivemaskeddiffusion,](https://arxiv.org/html/2605.11125#bib.bib76);[sahoo2025diffusionduality,](https://arxiv.org/html/2605.11125#bib.bib77);[sahoo2026scalingmaskeddiffusionlanguage,](https://arxiv.org/html/2605.11125#bib.bib78)\)\. The𝕊\\mathbb\{S\}\-arch achieves better results over the standard DiT \(Sec\.[4](https://arxiv.org/html/2605.11125#S4)\)\.

Takeaway\.We propose the𝕊\\mathbb\{S\}\-arch to implement the denoiserp1\|tθp^\{\\theta\}\_\{1\|t\}in place of the standard DiT\. We train with cross\-entropy \([14](https://arxiv.org/html/2605.11125#S3.E14)\) with a truncated, adaptive noise schedule\. We marginalize the conditional velocities underp1\|tθp^\{\\theta\}\_\{1\|t\}to obtainutθu^\{\\theta\}\_\{t\}\([15](https://arxiv.org/html/2605.11125#S3.E15)\) and integrate from𝐳0∼p0\\mathbf\{z\}\_\{0\}\\sim p\_\{0\}to𝐳1\\mathbf\{z\}\_\{1\}, and decode witharg⁡max\\arg\\max\. The velocity admits exact, stochastic, and top\-kkvariants\. See Algo\.[1](https://arxiv.org/html/2605.11125#alg1), Algo\.[2](https://arxiv.org/html/2605.11125#alg2), and Algo\.[3](https://arxiv.org/html/2605.11125#alg3)for pseudocode\.

## 4Experiments

We apply𝕊\\mathbb\{S\}\-FLM to Sudoku solving \(Sec\.[4\.1](https://arxiv.org/html/2605.11125#S4.SS1)\), math reasoning via code on TinyGSM\([liu2023tinygsm,](https://arxiv.org/html/2605.11125#bib.bib48)\)\(Sec\.[4\.2](https://arxiv.org/html/2605.11125#S4.SS2)\), and unconditional language modeling on OWT\([Gokaslan2019OpenWeb,](https://arxiv.org/html/2605.11125#bib.bib30)\)\(Sec\.[4\.3](https://arxiv.org/html/2605.11125#S4.SS3)\)\. The most common measure of sample quality in recent work on diffusion language models is*Generative Perplexity*\(Gen\. PPL\) and is computed using a large AR model\. However, samples with good likelihood are not necessarily correct at the sequence level\([velickovic2026perplexitycannotalwaystellright,](https://arxiv.org/html/2605.11125#bib.bib90);[feng2025theoreticalbenefitlimitationdiffusion,](https://arxiv.org/html/2605.11125#bib.bib26)\), and repetitive text has low perplexity\([dieleman2022continuousdiffusioncategoricaldata,](https://arxiv.org/html/2605.11125#bib.bib22);[deschenaux2024promisesoutlookschallengesdiffusion,](https://arxiv.org/html/2605.11125#bib.bib18)\)\. Therefore, we primarily focus on datasets with ground\-truth solutions \(Sudoku, GSM8K\)\.

### 4\.1Reasoning on Sudoku

#### Experimental Setup

We compare𝕊\\mathbb\{S\}\-FLM with AR and recent diffusion models on Sudoku\([benhamu2025acceleratedsamplingmaskeddiffusion,](https://arxiv.org/html/2605.11125#bib.bib7);[kim2025finetuningmaskeddiffusionprovable,](https://arxiv.org/html/2605.11125#bib.bib42)\)\. We use 48k training and 2k validation puzzles \(no overlap\), each with a unique solution\([alp2024sudoku,](https://arxiv.org/html/2605.11125#bib.bib3)\)\. We define three difficulty levels, based on the number of visible digits \(easy: 40/81, medium: 35/81, hard: 30/81\)\. We use the modified*Diffusion Transformer*\(DiT\)\([peebles2023scalablediffusionmodelstransformers,](https://arxiv.org/html/2605.11125#bib.bib66)\)architecture from SEDD\([lou2024discretediffusionmodelingestimating,](https://arxiv.org/html/2605.11125#bib.bib52)\)with 8 layers and embedding dimension 512\. We compare against AR, MDLM\([sahoo2024simpleeffectivemaskeddiffusion,](https://arxiv.org/html/2605.11125#bib.bib76)\), Duo\([sahoo2025diffusionduality,](https://arxiv.org/html/2605.11125#bib.bib77)\), CANDI\([pynadath2025candihybriddiscretecontinuousdiffusion,](https://arxiv.org/html/2605.11125#bib.bib69)\), and FLM\([lee2026onesteplanguagemodelingcontinuous,](https://arxiv.org/html/2605.11125#bib.bib45)\), using 180 sampling steps for the diffusion variants\. We describe the input format and training hyperparameters in Suppl\.[C\.3](https://arxiv.org/html/2605.11125#A3.SS3)\.

#### Results

Table 1:Exact match accuracy \(%\) on Sudoku when sampling with 180 steps\. The overall best isunderlined\. The best score with continuous diffusion isbolded\. CANDI is a hybrid continuous\-masked model, which can also be trained as a pure Gaussian diffusion model\.𝕊\\mathbb\{S\}\-FLM is best, but similar to FLM\.Table[1](https://arxiv.org/html/2605.11125#S4.T1)shows the results\. The autoregressive model performs poorly on all difficulties, since the task requires global context\. Duo\([sahoo2025diffusionduality,](https://arxiv.org/html/2605.11125#bib.bib77)\)obtains the highest accuracy\. Among the continuous methods, FLM\([lee2026onesteplanguagemodelingcontinuous,](https://arxiv.org/html/2605.11125#bib.bib45)\)and𝕊\\mathbb\{S\}\-FLM outperform AR and MDLM at every difficulty\.𝕊\\mathbb\{S\}\-FLM with the simple linear schedule performs poorly on hard Sudokus, but with truncation and the adaptive schedule,𝕊\\mathbb\{S\}\-FLM performs similarly to the prior best continuous language models\. See Suppl\.[C\.4](https://arxiv.org/html/2605.11125#A3.SS4)for the ablation over schedules and embedding re\-projection\. We do not re\-project embeddings to𝕊d−1\\mathbb\{S\}^\{d\-1\}after each optimizer step in Table[1](https://arxiv.org/html/2605.11125#S4.T1)\.

![Refer to caption](https://arxiv.org/html/2605.11125v1/x4.png)
![Refer to caption](https://arxiv.org/html/2605.11125v1/x5.png)

Figure 3:Accuracy on GSM8K withT=0\.1T=0\.1\.Left:Decoding strategies for𝕊\\mathbb\{S\}\-FLM \(𝕊\\mathbb\{S\}\-arch\)\. At low temperature, sampling with the exact or stochastic velocities approaches the accuracy with top\-11decoding\.Right:AtT=0\.1T=0\.1the standard DiT and the𝕊\\mathbb\{S\}\-arch perform similarly, and their accuracy is roughly half of that of Duo\. AtT=1T=1the𝕊\\mathbb\{S\}\-arch outperforms the standard DiT \(Figure[6](https://arxiv.org/html/2605.11125#A3.F6)\)\.

### 4\.2Reasoning on GSM8K

#### Experimental Setup

TinyGSM\([liu2023tinygsm,](https://arxiv.org/html/2605.11125#bib.bib48)\)is a dataset of∼\\sim11\.8M synthetic math word problems similar to GSM8K\([cobbe2021trainingverifierssolvemath,](https://arxiv.org/html/2605.11125#bib.bib14)\)\. Each solution is a GPT\-3\.5\-generated Python program that produces the numerical answer\. We compare our models in the GSM8K test, after executing one generated solution per problem\. We plot error bars corresponding to bootstrapped confidence intervals in the percentile95%95\\%on the test set \(Suppl\.[C\.5](https://arxiv.org/html/2605.11125#A3.SS5)\)\. We use theSmolLM\-135Mtokenizer\([allal2025smollm2smolgoesbig,](https://arxiv.org/html/2605.11125#bib.bib2)\)because it compresses the code better than the GPT\-2 tokenizer\([Radford2019LanguageMA,](https://arxiv.org/html/2605.11125#bib.bib71)\)\(Figure[5](https://arxiv.org/html/2605.11125#A3.F5)\)\. We pad the input sequences to length 512 and compute the loss on padding tokens\. As for Sudoku, we always keep the problem statement clean, so that we can sample conditionally\. All backbones use hidden dimension 768, 12 layers, 12 attention heads, and dropout 0\.1\. For𝕊\\mathbb\{S\}\-FLM, we train both the standard DiT and the𝕊\\mathbb\{S\}\-arch \(Sec\.[3\.3](https://arxiv.org/html/2605.11125#S3.SS3)\)\. We train for 250k steps with batch size 512, using Adam \(lr=3×10−4\\text\{lr\}=3\\times 10^\{\-4\},β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, no weight decay\) and an EMA rate of0\.99990\.9999\.

#### Results

We report the accuracy atT=1T=1with 1024 sampling steps for diffusion variants in Table[2](https://arxiv.org/html/2605.11125#S4.T2)\. The accuracy of CANDI\([pynadath2025candihybriddiscretecontinuousdiffusion,](https://arxiv.org/html/2605.11125#bib.bib69)\)and FLM\([lee2026onesteplanguagemodelingcontinuous,](https://arxiv.org/html/2605.11125#bib.bib45)\)is below1%1\\%despite using their respective optimized schedules\. Temperature annealing did not improve the accuracy above0\.5%0\.5\\%\. In contrast,𝕊\\mathbb\{S\}\-FLM solves18%18\\%of problems with top\-11decoding\. In Table[2](https://arxiv.org/html/2605.11125#S4.T2), we ablate on the impact of each new element of𝕊\\mathbb\{S\}\-FLM\. With the vanilla linear schedule,𝕊\\mathbb\{S\}\-FLM has an accuracy of1\.2%1\.2\\%\.

Table 2:Accuracy \(%\) on GSM8K after 250k training steps\. Diffusion variants use 1024 steps, no temperature scaling, AR uses 512 \(context length\)\. The overall best isunderlined, the best continuous diffusion result isbolded\.𝕊\\mathbb\{S\}\-FLM outperforms prior continuous diffusion language models, andcloses the gap to MDLM atT=1T=1when the velocity \([15](https://arxiv.org/html/2605.11125#S3.E15)\) is computed with top\-k=1k=1\.A gap remains at low\-temperature \(T=0\.1T=0\.1\) decoding, where MDLM and Duo reach3333–36%36\\%\(Figure[3](https://arxiv.org/html/2605.11125#S4.F3)\)\.The truncation of the noise schedule according to \([16](https://arxiv.org/html/2605.11125#S3.E16)\) improves to 7\.7%\.*Not*re\-normalizing the embeddings after each optimization step \(Suppl\.[B\.5](https://arxiv.org/html/2605.11125#A2.SS5)\) increases it to 8\.1%\. Note that the denoiser always processes unit\-norm vectors in the forward pass, but the embedding table itself can either be re\-projected to𝕊d−1\\mathbb\{S\}^\{d\-1\}after every optimization step or left unconstrained\. Skipping the re\-projection is equivalent to annealing learning rate for embedding vectors \(Suppl\.[B\.5](https://arxiv.org/html/2605.11125#A2.SS5)\), which might stabilize training since the velocity \([15](https://arxiv.org/html/2605.11125#S3.E15)\) is a function of the embeddings\. The adaptive noise schedule improves the accuracy to 11\.1% and𝕊\\mathbb\{S\}\-arch to 12\.4%\. With top\-k=1k=1velocity decoding \([15](https://arxiv.org/html/2605.11125#S3.E15)\),𝕊\\mathbb\{S\}\-FLM closes the gap to MDLM and Duo atT=1T=1\. A gap remains at low\-temperature \(T=0\.1T=0\.1\) decoding, where MDLM and Duo reach3333–36%36\\%\(Figure[3](https://arxiv.org/html/2605.11125#S4.F3)\)\.

#### Ablations

Figure[1](https://arxiv.org/html/2605.11125#S0.F1)shows that atT=1T=1, with NFE≤16\\leq 16,𝕊\\mathbb\{S\}\-FLM outperforms MDLM and Duo\. All methods benefit from greedy decoding \(Sec\.[3\.1](https://arxiv.org/html/2605.11125#S3.SS1.SSS0.Px3)\); therefore, we compare𝕊\\mathbb\{S\}\-FLM with MDLM, Duo at low temperature \(T=0\.1T=0\.1\)\. AtT=0\.1T=0\.1, MDLM reaches33%33\\%and Duo36%36\\%, while𝕊\\mathbb\{S\}\-FLM is around18%18\\%\(Figure[3](https://arxiv.org/html/2605.11125#S4.F3)\)\. Thus, while𝕊\\mathbb\{S\}\-FLM outperforms previous continuous approaches, there is a large performance gap at low temperature compared to discrete diffusion and AR\. Figure[8](https://arxiv.org/html/2605.11125#A3.F8)shows that atT=1T=1, the onlykksuch that top\-kkdecoding significantly improves the accuracy isk=1k=1\. In addition, the top\-11decoding beats the low\-temperature and the stochastic sampler \(Figure[9](https://arxiv.org/html/2605.11125#A3.F9)\)\. See Suppl\.[C\.7](https://arxiv.org/html/2605.11125#A3.SS7)for more details\.

### 4\.3Language Modeling on OpenWebText

![Refer to caption](https://arxiv.org/html/2605.11125v1/x6.png)
![Refer to caption](https://arxiv.org/html/2605.11125v1/x7.png)

Figure 4:Gen\. PPL \(↓\) / Entropy \(↑\) Frontier on OpenWebTextat NFE=32=32\(left\) and NFE=1024=1024\(right\)\. Each curve is obtained by sweeping over the temperatureTT\.𝕊\\mathbb\{S\}\-FLM with the𝕊\\mathbb\{S\}\-arch performs similarly to prior FLMs\. Duo is best overall\. At NFE = 32, the frontier of FLM is highly unstable\.#### Experimental Setup

Following MDLM\([sahoo2024simpleeffectivemaskeddiffusion,](https://arxiv.org/html/2605.11125#bib.bib76)\), we train𝕊\\mathbb\{S\}\-FLM on OpenWebText using the GPT\-2 tokenizer and a context length of 1024\. The model is a 12\-layer standard DiT backbone with hidden dimension 768 and 12 attention heads\. We train for 1M steps with the same optimizer configuration as in TinyGSM \(Sec\.[4\.2](https://arxiv.org/html/2605.11125#S4.SS2)\)\. For FLM\([lee2026onesteplanguagemodelingcontinuous,](https://arxiv.org/html/2605.11125#bib.bib45)\), we use the original checkpoint released by the authors\.

#### Gen\. PPL / Entropy Frontier

Repetitive generations can improve the Gen\. PPL without improving sample quality\([velickovic2026perplexitycannotalwaystellright,](https://arxiv.org/html/2605.11125#bib.bib90)\), and different models might have different optimal sampling temperatures\. We therefore evaluate the Gen\. PPL and the average unigram entropy acrossT∈\{0\.70,0\.75,…,1\.20\}T\\in\\\{0\.70,0\.75,\\ldots,1\.20\\\}and plot the frontier for each model and NFE \(details in Suppl\.[C\.8](https://arxiv.org/html/2605.11125#A3.SS8)\)\([pynadath2025candihybriddiscretecontinuousdiffusion,](https://arxiv.org/html/2605.11125#bib.bib69)\)\. In NFE=1024=1024\(right\),𝕊\\mathbb\{S\}\-FLM with the𝕊\\mathbb\{S\}\-arch matches prior FLMs in Gen\. PPL, while the standard DiT is slightly weaker\. At low NFE, the frontier becomes unstable for prior FLMs, but remains stable for𝕊\\mathbb\{S\}\-FLM, which has a similar Gen\. PPL to MDLM\. Duo generally achieves the best Gen\. PPL / Entropy trade\-off\. See Suppl\.[C\.9](https://arxiv.org/html/2605.11125#A3.SS9)for the complete sweep of NFEs\.

## 5Related Work

𝕊\\mathbb\{S\}\-FLM differs from prior work in three ways: it operates in continuous rather than discrete space, defines its flow on the hypersphere𝕊d−1\\mathbb\{S\}^\{d\-1\}, and learns token embeddings end\-to\-end\.

#### Continuous diffusion for language modeling

Prior work applies Gaussian diffusion to embeddings and trains end\-to\-end with cross\-entropy\([li2022diffusionlmimprovescontrollabletext,](https://arxiv.org/html/2605.11125#bib.bib46);[gulrajani2023likelihoodbaseddiffusionlanguagemodels,](https://arxiv.org/html/2605.11125#bib.bib33)\), or regresses onto pre\-trained embeddings\([strudel2022selfconditionedembeddingdiffusion,](https://arxiv.org/html/2605.11125#bib.bib86);[lovelace2022latentdiffusionlanguagegeneration,](https://arxiv.org/html/2605.11125#bib.bib54);[shen2026codarcontinuousdiffusionlanguage,](https://arxiv.org/html/2605.11125#bib.bib81)\), which caps sample quality at the pre\-trained embeddings and requires two training stages\. Riemannian flow models extend score\-based generative modeling to manifolds\([mathieu2020riemanniancnf,](https://arxiv.org/html/2605.11125#bib.bib56);[debortoli2022riemannianscorebased,](https://arxiv.org/html/2605.11125#bib.bib8);[lou2023scalingriemanniandiffusion,](https://arxiv.org/html/2605.11125#bib.bib53)\), but generally assume data already lie on the manifold\.𝕊\\mathbb\{S\}\-FLM instead learns the embeddings and the velocity on𝕊d−1\\mathbb\{S\}^\{d\-1\}jointly and injects noise via rotations\.

#### Flow language models \(FLMs\)

A recent line of work treats language modeling as flow matching on continuous representations\. Several recent works\([sahoo2025diffusionduality,](https://arxiv.org/html/2605.11125#bib.bib77);[lee2026onesteplanguagemodelingcontinuous,](https://arxiv.org/html/2605.11125#bib.bib45);[roos2026categoricalflowmaps,](https://arxiv.org/html/2605.11125#bib.bib75);[potaptchik2026discreteflowmaps,](https://arxiv.org/html/2605.11125#bib.bib68)\)add Gaussian noise to one\-hot or simplex representations and decode viaarg⁡max\\arg\\max\. These approaches materialize denseL×\|𝒱\|L\\times\|\\mathcal\{V\}\|arrays at training and sampling time\. Fisher\-Flow\([davis2024fisherflowmatching,](https://arxiv.org/html/2605.11125#bib.bib17)\)maps one\-hot vectors to the positive orthant of𝕊d−1\\mathbb\{S\}^\{d\-1\}via the Fisher–Rao metric, but such approach does not scale well to language modeling with large vocabularies\([jo2025continuousdiffusionmodellanguage,](https://arxiv.org/html/2605.11125#bib.bib37)\)\.𝕊\\mathbb\{S\}\-FLM operates ondd\-dimensional token embeddings on𝕊d−1\\mathbb\{S\}^\{d\-1\}, learned end\-to\-end, which avoids the\|𝒱\|\|\\mathcal\{V\}\|\-dimensional bottleneck and trains faster \(Table[3](https://arxiv.org/html/2605.11125#A2.T3)\)\.

#### Representation learning on the hypersphere

Hyperspherical representations are common in contrastive learning, where uniform spread on𝕊d−1\\mathbb\{S\}^\{d\-1\}correlates with strong downstream performance\([wang2020understandingcontrastiverepresentationlearning,](https://arxiv.org/html/2605.11125#bib.bib94)\), and the cosine distance outperform Euclidean one for comparing word embeddings\([mikolov2013efficientestimationwordrepresentations,](https://arxiv.org/html/2605.11125#bib.bib58);[pennington2014glove,](https://arxiv.org/html/2605.11125#bib.bib67)\)and for retrieval\([reimers2019sentencebertsentenceembeddings,](https://arxiv.org/html/2605.11125#bib.bib74);[karpukhin2020densepassageretrieval,](https://arxiv.org/html/2605.11125#bib.bib40)\)\. A latent prior on𝕊d−1\\mathbb\{S\}^\{d\-1\}also stabilizes Variational Autoencoders\([davidson2018hypersphericalvariationalautoencoders,](https://arxiv.org/html/2605.11125#bib.bib16);[xu2018sphericallatentspacesvae,](https://arxiv.org/html/2605.11125#bib.bib96)\), and normalizing activations and weights to𝕊d−1\\mathbb\{S\}^\{d\-1\}improves the stability of AR models\([loshchilov2024ngptnormalizedtransformerrepresentation,](https://arxiv.org/html/2605.11125#bib.bib50)\)\. Suppl\.[D](https://arxiv.org/html/2605.11125#A4)gives a fuller discussion\.

## 6Conclusion

We introduced𝕊\\mathbb\{S\}\-FLM, a Riemannian flow on𝕊d−1\\mathbb\{S\}^\{d\-1\}that is competitive at language modeling with large vocabularies and learns the velocity and embeddings jointly\. Beyond the formalism, our key contributions include the𝕊\\mathbb\{S\}\-arch backbone with normalized activations and the truncated and adaptive noise\-schedule analysis\. On Sudoku,𝕊\\mathbb\{S\}\-FLM performs similarly to prior FLMs\. On GSM8K, where other FLMs fail,𝕊\\mathbb\{S\}\-FLM with top\-11decoding closes the gap to MDLM and Duo atT=1T=1, though a gap remains at low\-temperature \(T=0\.1T=0\.1\) sampling\. On OpenWebText,𝕊\\mathbb\{S\}\-FLM follows the Gen\. PPL / Entropy frontier of prior FLMs at high NFE and beats them at low NFE where prior continuous models have an unstable frontier\.𝕊\\mathbb\{S\}\-FLM avoids materializing\|𝒱\|\|\\mathcal\{V\}\|\-dimensional one\-hot vectors, so it trains faster than prior FLMs and may be easier to scale \(though this is speculation and left for future work\)\. Our analysis of the sampling\-dynamics provides a principled heuristic for truncating the noise schedule that works well in practice\. At the same time, there remains a clear gap between diffusion models and AR models in GSM8K\.

## 7Limitations

Continuous diffusion language models, including𝕊\\mathbb\{S\}\-FLM, underperform their discrete counterparts\. In TinyGSM,𝕊\\mathbb\{S\}\-FLM reduces the gap compared to prior FLMs but does not eliminate it, and a substantial gap to autoregressive decoding remains \(Sec\.[4\.2](https://arxiv.org/html/2605.11125#S4.SS2)\)\. The truncation thresholdα⋆\(δ\)\\alpha^\{\\star\}\(\\delta\)is derived a simplified model \(Suppl\.[C\.2](https://arxiv.org/html/2605.11125#A3.SS2)\) which assumes that the embeddings are randomly distributed on the sphere\. The learned embeddings are likely more structured\. Therefore,α⋆\(δ\)\\alpha^\{\\star\}\(\\delta\)should only be treated as a principled heuristic for the truncation hyperparameter\. A more sophisticated model of the sampling dynamics might improve the bound\. The𝕊\\mathbb\{S\}\-arch follows the design of nGPT for simplicity and trains slower than the standard DiT\. Improving its throughput is an important future direction\. The training dynamics of𝕊\\mathbb\{S\}\-FLM should also be studied further, and the results of contrastive representation learning may be relevant\. We train a single model per configuration\. Training once on TinyGSM costs more than $300, thus we could not afford to train several copies with the same hyperparameters\.

## 8Impact Statement

This work advances research on continuous diffusion language models\. Like any language modeling research, it carries the standard risks of misuse for misinformation or harmful content\. Our models are trained at small scale on Sudoku, TinyGSM \(synthetc math word problems\), and OpenWebText, and remain far below the capabilities of state\-of\-the\-art autoregressive language models\. Anyone seeking to cause harm has stronger publicly available models at their disposal\. The contribution is methodological\.

## Acknowledgments and Disclosure of Funding

Acknowledgements

## References

- \[1\]Michael S\. Albergo and Eric Vanden\-Eijnden\.Building normalizing flows with stochastic interpolants, 2023\.
- \[2\]Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan\-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, and Thomas Wolf\.Smollm2: When smol goes big – data\-centric training of a small language model, 2025\.
- \[3\]Ali Alp\.Sudoku puzzle generator\.[https://github\.com/alicommit\-malp/sudoku](https://github.com/alicommit-malp/sudoku), 2024\.
- \[4\]Marianne Arriola, Aaron Gokaslan, Justin T\. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov\.Block diffusion: Interpolating between autoregressive and diffusion language models, 2025\.
- \[5\]Jacob Austin, Daniel D\. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg\.Structured denoising diffusion models in discrete state\-spaces, 2023\.
- \[6\]Richard Bellman\.Adaptive Control Processes: A Guided Tour\.Princeton University Press, Princeton, NJ, 1961\.
- \[7\]Heli Ben\-Hamu, Itai Gat, Daniel Severo, Niklas Nolte, and Brian Karrer\.Accelerated sampling from masked diffusion models via entropy bounded unmasking, 2025\.
- \[8\]Valentin De Bortoli, Emile Mathieu, Michael Hutchinson, James Thornton, Yee Whye Teh, and Arnaud Doucet\.Riemannian score\-based generative modelling, 2022\.
- \[9\]Andrew Campbell, Joe Benton, Valentin De Bortoli, Tom Rainforth, George Deligiannidis, and Arnaud Doucet\.A continuous time framework for discrete denoising models, 2022\.
- \[10\]Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola\.Generative flows on discrete state\-spaces: Enabling multimodal flows with applications to protein co\-design, 2024\.
- \[11\]Ricky T\. Q\. Chen and Yaron Lipman\.Flow matching on general geometries, 2024\.
- \[12\]Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Kristjanson Duvenaud\.Neural ordinary differential equations\.NeurIPS 2018, abs/1806\.07366, 2018\.
- \[13\]Ting Chen, Ruixiang Zhang, and Geoffrey Hinton\.Analog bits: Generating discrete data using diffusion models with self\-conditioning, 2022\.
- \[14\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training verifiers to solve math word problems, 2021\.
- \[15\]Earl A\. Coddington and Norman Levinson\.Theory of Ordinary Differential Equations\.McGraw\-Hill, New York, 1955\.
- \[16\]Tim R\. Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M\. Tomczak\.Hyperspherical variational auto\-encoders, 2018\.
- \[17\]Oscar Davis, Samuel Kessler, Mircea Petrache, İsmail İlkan Ceylan, Michael Bronstein, and Avishek Joey Bose\.Fisher flow matching for generative modeling over discrete data, 2024\.
- \[18\]Justin Deschenaux and Caglar Gulcehre\.Promises, outlooks and challenges of diffusion language modeling, 2024\.
- \[19\]Justin Deschenaux, Caglar Gulcehre, and Subham Sekhar Sahoo\.The diffusion duality, chapter ii:ψ\\psi\-samplers and efficient curriculum, 2026\.
- \[20\]Justin Deschenaux, Lan Tran, and Caglar Gulcehre\.Partition generative modeling: Masked modeling without masks, 2025\.
- \[21\]Prafulla Dhariwal and Alexander Nichol\.Diffusion models beat gans on image synthesis\.Advances in neural information processing systems, 34:8780–8794, 2021\.
- \[22\]Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H\. Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, Curtis Hawthorne, Rémi Leblond, Will Grathwohl, and Jonas Adler\.Continuous diffusion for categorical data, 2022\.
- \[23\]Manfredo Perdigão do Carmo\.Riemannian Geometry\.Mathematics: Theory & Applications\. Birkhäuser Boston, 1992\.
- \[24\]Bradley Efron\.Bootstrap methods: Another look at the jackknife\.The Annals of Statistics, 7\(1\):1–26, 1979\.
- \[25\]Floor Eijkelboom, Grigory Bartosh, Christian Andersson Naesseth, Max Welling, and Jan\-Willem van de Meent\.Variational flow matching for graph generation, 2025\.
- \[26\]Guhao Feng, Yihan Geng, Jian Guan, Wei Wu, Liwei Wang, and Di He\.Theoretical benefit and limitation of diffusion language model, 2025\.
- \[27\]F\. N\. Fritsch and J\. Butland\.A method for constructing local monotone piecewise cubic interpolants\.SIAM Journal on Scientific and Statistical Computing, 5\(2\):300–304, 1984\.
- \[28\]Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T\. Q\. Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman\.Discrete flow matching, 2024\.
- \[29\]Gemini Team, Rohan Anil, Sebastian Borgeaud, et al\.Gemini: A family of highly capable multimodal models, 2025\.
- \[30\]Aaron Gokaslan and Vanya Cohen\.Openwebtext corpus\.[http://Skylion007\.github\.io/OpenWebTextCorpus](http://skylion007.github.io/OpenWebTextCorpus), 2019\.
- \[31\]Google\.Gemma 3 technical report, 2025\.
- \[32\]Will Grathwohl, Ricky T\. Q\. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud\.Ffjord: Free\-form continuous dynamics for scalable reversible generative models, 2018\.
- \[33\]Ishaan Gulrajani and Tatsunori B\. Hashimoto\.Likelihood\-based diffusion language models, 2023\.
- \[34\]Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov\.Ssd\-lm: Semi\-autoregressive simplex\-based diffusion language model for text generation and modular control, 2023\.
- \[35\]Jonathan Ho, Ajay Jain, and Pieter Abbeel\.Denoising diffusion probabilistic models, 2020\.
- \[36\]Jonathan Ho and Tim Salimans\.Classifier\-free diffusion guidance, 2022\.
- \[37\]Jaehyeong Jo and Sung Ju Hwang\.Continuous diffusion model for language modeling, 2025\.
- \[38\]Mingyu Jo, Jaesik Yoon, Justin Deschenaux, Caglar Gulcehre, and Sungjin Ahn\.Loopholing discrete diffusion: Deterministic bypass of the sampling wall, 2025\.
- \[39\]Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B\. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei\.Scaling laws for neural language models, 2020\.
- \[40\]Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih\.Dense passage retrieval for open\-domain question answering, 2020\.
- \[41\]Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine\.Analyzing and improving the training dynamics of diffusion models, 2024\.
- \[42\]Jaeyeon Kim, Seunggeun Kim, Taekyun Lee, David Z\. Pan, Hyeji Kim, Sham Kakade, and Sitan Chen\.Fine\-tuning masked diffusion for provable self\-correction, 2025\.
- \[43\]Diederik P\. Kingma and Jimmy Ba\.Adam: A method for stochastic optimization, 2017\.
- \[44\]Ouail Kitouni, Niklas Nolte, Diane Bouchacourt, Adina Williams, Mike Rabbat, and Mark Ibrahim\.The factorization curse: Which tokens you predict underlie the reversal curse and more, 2024\.
- \[45\]Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M\. Boffi, and Jinwoo Kim\.One\-step language modeling via continuous denoising, 2026\.
- \[46\]Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B\. Hashimoto\.Diffusion\-lm improves controllable text generation, 2022\.
- \[47\]Yaron Lipman, Ricky T\. Q\. Chen, Heli Ben\-Hamu, Maximilian Nickel, and Matt Le\.Flow matching for generative modeling, 2023\.
- \[48\]Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang\.TinyGSM: Achieving\>\>80% on GSM8k with small language models, 2023\.
- \[49\]Xingchao Liu, Chengyue Gong, and Qiang Liu\.Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022\.
- \[50\]Ilya Loshchilov, Cheng\-Ping Hsieh, Simeng Sun, and Boris Ginsburg\.ngpt: Normalized transformer with representation learning on the hypersphere, 2024\.
- \[51\]Aaron Lou, Derek Lim, Isay Katsman, Leo Huang, Qingxuan Jiang, Ser\-Nam Lim, and Christopher De Sa\.Neural manifold ordinary differential equations, 2020\.
- \[52\]Aaron Lou, Chenlin Meng, and Stefano Ermon\.Discrete diffusion modeling by estimating the ratios of the data distribution, 2024\.
- \[53\]Aaron Lou, Minkai Xu, and Stefano Ermon\.Scaling riemannian diffusion models, 2023\.
- \[54\]Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q\. Weinberger\.Latent diffusion for language generation, 2022\.
- \[55\]Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E\. Peters, and Arman Cohan\.TESS: Text\-to\-text self\-conditioned simplex diffusion, 2023\.
- \[56\]Emile Mathieu and Maximilian Nickel\.Riemannian continuous normalizing flows, 2020\.
- \[57\]Meta\.The llama 3 herd of models, 2024\.
- \[58\]Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean\.Efficient estimation of word representations in vector space, 2013\.
- \[59\]Mervin E\. Muller\.A note on a method for generating points uniformly on n\-dimensional spheres\.Communications of the ACM, 2\(4\):19–20, 1959\.
- \[60\]Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, and Aditi Raghunathan\.Roll the dice & look before you leap: Going beyond the creative limits of next\-token prediction\.arXiv preprint arXiv:2504\.15266, 2025\.
- \[61\]Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li\.Scaling up masked diffusion models on text, 2025\.
- \[62\]Hunter Nisonoff, Junhao Xiong, Stephan Allenspach, and Jennifer Listgarten\.Unlocking guidance for discrete state\-space diffusion and flow models\.arXiv preprint arXiv:2406\.01572, 2024\.
- \[63\]OpenAI\.Gpt\-4 technical report, 2024\.
- \[64\]OpenAI\.Gpt\-oss: open\-weight language models by openai\.[https://github\.com/openai/gpt\-oss](https://github.com/openai/gpt-oss), 2024\.GitHub repository\.
- \[65\]Vassilis Papadopoulos, Jérémie Wenger, and Clément Hongler\.Arrows of time for large language models, 2024\.
- \[66\]William Peebles and Saining Xie\.Scalable diffusion models with transformers, 2023\.
- \[67\]Jeffrey Pennington, Richard Socher, and Christopher Manning\.GloVe: Global vectors for word representation\.InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing \(EMNLP\), pages 1532–1543\. Association for Computational Linguistics, 2014\.
- \[68\]Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden\-Eijnden, and Michael S\. Albergo\.Discrete flow maps, 2026\.
- \[69\]Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang\.Candi: Hybrid discrete\-continuous diffusion models, 2025\.
- \[70\]Qwen Team\.Qwen2\.5 technical report, 2025\.
- \[71\]Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever\.Language models are unsupervised multitask learners, 2019\.
- \[72\]Sebastian Raschka\.Creating confidence intervals for machine learning classifiers\.[https://sebastianraschka\.com/blog/2022/confidence\-intervals\-for\-ml\.html](https://sebastianraschka.com/blog/2022/confidence-intervals-for-ml.html), 2022\.Accessed 2026\-05\-05\.
- \[73\]Gabriel Raya, Bac Nguyen, Georgios Batzolis, Yuhta Takida, Dejan Stancevic, Naoki Murata, Chieh\-Hsin Lai, Yuki Mitsufuji, and Luca Ambrogioni\.Information\-guided noise allocation for efficient diffusion training, 2026\.
- \[74\]Nils Reimers and Iryna Gurevych\.Sentence\-bert: Sentence embeddings using siamese bert\-networks, 2019\.
- \[75\]Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, İsmail İlkan Ceylan, Luca Ambrogioni, and Jan\-Willem van de Meent\.Categorical flow maps, 2026\.
- \[76\]Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov\.Simple and effective masked diffusion language models, 2024\.
- \[77\]Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and Volodymyr Kuleshov\.The diffusion duality, 2025\.
- \[78\]Subham Sekhar Sahoo, Jean\-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, and Ante Jukic\.Scaling beyond masked diffusion language models, 2026\.
- \[79\]Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, and Arash Vahdat\.Esoteric language models, 2025\.
- \[80\]Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla\-torre, Bernardo P\. de Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov\.Simple guidance mechanisms for discrete diffusion models, 2025\.
- \[81\]Junzhe Shen, Jieru Zhao, Ziwei He, and Zhouhan Lin\.Codar: Continuous diffusion language models are more powerful than you think, 2026\.
- \[82\]Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K\. Titsias\.Simplified and generalized masked diffusion for discrete data, 2025\.
- \[83\]Ken Shoemake\.Animating rotation with quaternion curves\.InProceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’85, New York, NY, USA, 1985\. Association for Computing Machinery\.
- \[84\]Jascha Sohl\-Dickstein, Eric A\. Weiss, Niru Maheswaranathan, and Surya Ganguli\.Deep unsupervised learning using nonequilibrium thermodynamics, 2015\.
- \[85\]Hannes Stark, Bowen Jing, Chenyu Wang, Gabriele Corso, Bonnie Berger, Regina Barzilay, and Tommi Jaakkola\.Dirichlet flow matching with applications to dna sequence design, 2024\.
- \[86\]Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, and Rémi Leblond\.Self\-conditioned embedding diffusion for text generation, 2022\.
- \[87\]Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu\.Roformer: Enhanced transformer with rotary position embedding, 2023\.
- \[88\]Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector\-Brooks, Guy Wolf, and Yoshua Bengio\.Improving and generalizing flow\-based generative models with minibatch optimal transport, 2024\.
- \[89\]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N\. Gomez, Lukasz Kaiser, and Illia Polosukhin\.Attention is all you need, 2023\.
- \[90\]Petar Veličković, Federico Barbero, Christos Perivolaropoulos, Simon Osindero, and Razvan Pascanu\.Perplexity cannot always tell right from wrong, 2026\.
- \[91\]Roman Vershynin\.High\-Dimensional Probability: An Introduction with Applications in Data Science\.Number 47 in Cambridge Series in Statistical and Probabilistic Mathematics\. Cambridge University Press, 2018\.
- \[92\]Dimitri von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, and Thomas Hofmann\.Generalized interpolating discrete diffusion, 2025\.
- \[93\]Dimitri von Rütte, Janis Fluri, Omead Pooladzandi, Bernhard Schölkopf, Thomas Hofmann, and Antonio Orvieto\.Scaling behavior of discrete diffusion language models, 2026\.
- \[94\]Tongzhou Wang and Phillip Isola\.Understanding contrastive representation learning through alignment and uniformity on the hypersphere, 2020\.
- \[95\]Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie\.Fast\-dllm: Training\-free acceleration of diffusion llm by enabling kv cache and parallel decoding, 2025\.
- \[96\]Jiacheng Xu and Greg Durrett\.Spherical latent spaces for stable variational autoencoders, 2018\.
- \[97\]Olga Zaghen, Floor Eijkelboom, Alison Pouplin, Cong Liu, Max Welling, Jan\-Willem van de Meent, and Erik J\. Bekkers\.Riemannian variational flow matching for material and protein design, 2026\.
- \[98\]Daniel Zhang\-Li, Nianyi Lin, Jifan Yu, Zheyuan Zhang, Zijun Yao, Xiaokang Zhang, Lei Hou, Jing Zhang, and Juanzi Li\.Reverse that number\! decoding order matters in arithmetic learning, 2024\.

###### Contents

1. [1Introduction](https://arxiv.org/html/2605.11125#S1)
2. [2Background](https://arxiv.org/html/2605.11125#S2)1. [2\.1Flow Generative Modeling](https://arxiv.org/html/2605.11125#S2.SS1) 2. [2\.2Geometry of the Hypersphere](https://arxiv.org/html/2605.11125#S2.SS2) 3. [2\.3Flow Matching on the Hypersphere](https://arxiv.org/html/2605.11125#S2.SS3)
3. [3Hyperspherical Flow Language Models](https://arxiv.org/html/2605.11125#S3)1. [3\.1Sampling](https://arxiv.org/html/2605.11125#S3.SS1) 2. [3\.2Noise Schedule](https://arxiv.org/html/2605.11125#S3.SS2) 3. [3\.3Hyperspherical Architecture](https://arxiv.org/html/2605.11125#S3.SS3)
4. [4Experiments](https://arxiv.org/html/2605.11125#S4)1. [4\.1Reasoning on Sudoku](https://arxiv.org/html/2605.11125#S4.SS1) 2. [4\.2Reasoning on GSM8K](https://arxiv.org/html/2605.11125#S4.SS2) 3. [4\.3Language Modeling on OpenWebText](https://arxiv.org/html/2605.11125#S4.SS3)
5. [5Related Work](https://arxiv.org/html/2605.11125#S5)
6. [6Conclusion](https://arxiv.org/html/2605.11125#S6)
7. [7Limitations](https://arxiv.org/html/2605.11125#S7)
8. [8Impact Statement](https://arxiv.org/html/2605.11125#S8)
9. [References](https://arxiv.org/html/2605.11125#bib)
10. [ABackground on Flow Matching](https://arxiv.org/html/2605.11125#A1)1. [A\.1The Continuity Equation](https://arxiv.org/html/2605.11125#A1.SS1) 2. [A\.2Deriving the Conditional Velocity Fields](https://arxiv.org/html/2605.11125#A1.SS2)
11. [BAdditional Details](https://arxiv.org/html/2605.11125#A2)1. [B\.1Training and Sampling Pseudocode](https://arxiv.org/html/2605.11125#A2.SS1) 2. [B\.2Adaptive Noise Schedule](https://arxiv.org/html/2605.11125#A2.SS2) 3. [B\.3Hyperspherical Backbone Architecture](https://arxiv.org/html/2605.11125#A2.SS3) 4. [B\.4Step Size with the Euler Sampler](https://arxiv.org/html/2605.11125#A2.SS4) 5. [B\.5Re\-Normalizing Embeddings](https://arxiv.org/html/2605.11125#A2.SS5) 6. [B\.6Training Cost](https://arxiv.org/html/2605.11125#A2.SS6)
12. [CAdditional Results](https://arxiv.org/html/2605.11125#A3)1. [C\.1Sequence Length Using Different Tokenizers](https://arxiv.org/html/2605.11125#A3.SS1) 2. [C\.2Analysis under the Random Codebook Model](https://arxiv.org/html/2605.11125#A3.SS2) 3. [C\.3Sudoku Setup](https://arxiv.org/html/2605.11125#A3.SS3) 4. [C\.4Additional Ablations on Sudoku](https://arxiv.org/html/2605.11125#A3.SS4) 5. [C\.5Bootstrap Confidence Intervals on TinyGSM](https://arxiv.org/html/2605.11125#A3.SS5) 6. [C\.6Additional Ablations on TinyGSM](https://arxiv.org/html/2605.11125#A3.SS6) 7. [C\.7Additional Sampling Results on TinyGSM](https://arxiv.org/html/2605.11125#A3.SS7) 8. [C\.8Gen\. PPL / Entropy Frontier Protocol](https://arxiv.org/html/2605.11125#A3.SS8) 9. [C\.9Additional Results on OpenWebText](https://arxiv.org/html/2605.11125#A3.SS9)
13. [DExtended Related Work](https://arxiv.org/html/2605.11125#A4)

## Appendix ABackground on Flow Matching

### A\.1The Continuity Equation

The velocity fieldutu\_\{t\}and the densityptp\_\{t\}are related locally through the*continuity equation*:

∂tpt\+∇⋅\(ptut\)=0\.\\partial\_\{t\}p\_\{t\}\+\\nabla\\cdot\(p\_\{t\}\\,u\_\{t\}\)=0\.\(17\)This guarantees that matchingutu\_\{t\}suffices to matchptp\_\{t\}, without computingϕt\\phi\_\{t\}or its Jacobian\. The Flow Matching \(FM\) objective regressesutθu^\{\\theta\}\_\{t\}against a target velocity field:

ℒFM\(θ\)=𝔼t∼𝒰\[0,1\],x∼pt‖utθ\(x\)−ut\(x\)‖2\.\\mathcal\{L\}\_\{\\text\{FM\}\}\(\\theta\)=\\mathbb\{E\}\_\{t\\sim\\mathcal\{U\}\[0,1\],\\,x\\sim p\_\{t\}\}\\left\\\|u^\{\\theta\}\_\{t\}\(x\)\-u\_\{t\}\(x\)\\right\\\|^\{2\}\.\(18\)In general,utu\_\{t\}andptp\_\{t\}are intractable\. CFM replaces them with tractable conditional quantities\. Differentiating the conditional flowxt=αtx1\+\(1−αt\)x0x\_\{t\}=\\alpha\_\{t\}x\_\{1\}\+\(1\-\\alpha\_\{t\}\)x\_\{0\}yields the conditional velocity fieldut\|1\(xt∣x1\)=α˙t\(x1−x0\)u\_\{t\|1\}\(x\_\{t\}\\mid x\_\{1\}\)=\\dot\{\\alpha\}\_\{t\}\(x\_\{1\}\-x\_\{0\}\)\. The marginal velocity field is recovered by averaging under the posterior:

ut\(x\)=∫ut\|1\(x∣x1\)pt\|1\(x∣x1\)q\(x1\)pt\(x\)𝑑x1=∫ut\|1\(x∣x1\)p1\|t\(x1∣x\)𝑑x1\.u\_\{t\}\(x\)=\\int u\_\{t\|1\}\(x\\mid x\_\{1\}\)\\,\\frac\{p\_\{t\|1\}\(x\\mid x\_\{1\}\)\\,q\(x\_\{1\}\)\}\{p\_\{t\}\(x\)\}\\,dx\_\{1\}=\\int u\_\{t\|1\}\(x\\mid x\_\{1\}\)\\,p\_\{1\|t\}\(x\_\{1\}\\mid x\)\\,dx\_\{1\}\.\(19\)The CFM objective has the same gradients and minimizer asℒFM\\mathcal\{L\}\_\{\\text\{FM\}\}but is tractable\[[47](https://arxiv.org/html/2605.11125#bib.bib47),[88](https://arxiv.org/html/2605.11125#bib.bib88)\]\.

### A\.2Deriving the Conditional Velocity Fields

Training both Euclidean and Riemannian flow models requires computing the conditional velocity fieldut\|1\(𝐳t∣𝐳1\)u\_\{t\|1\}\(\\mathbf\{z\}\_\{t\}\\mid\\mathbf\{z\}\_\{1\}\)that serves as the regression target in the CFM loss\. In both cases, the derivation follows the same recipe: \(1\) define a conditional flowψt\|1\\psi\_\{t\|1\}, \(2\) differentiate with respect tottto obtaind𝐳tdt\\frac\{d\\mathbf\{z\}\_\{t\}\}\{dt\}, and \(3\) eliminate𝐳0\\mathbf\{z\}\_\{0\}so that the velocity field is expressed as a function of the current position𝐳t\\mathbf\{z\}\_\{t\}\. During training, any equivalent form of the velocity field can be used as the regression target\.

#### Euclidean Case

We define the conditional flow \([3](https://arxiv.org/html/2605.11125#S2.E3)\) as the linear interpolationxt=αtx1\+\(1−αt\)x0x\_\{t\}=\\alpha\_\{t\}\\,x\_\{1\}\+\(1\-\\alpha\_\{t\}\)\\,x\_\{0\}\. Differentiating with respect tott:

dxtdt=α˙t\(x1−x0\)\.\\frac\{dx\_\{t\}\}\{dt\}=\\dot\{\\alpha\}\_\{t\}\\,\(x\_\{1\}\-x\_\{0\}\)\.\(20\)By simple algebra, we findx0=\(xt−αtx1\)/\(1−αt\)x\_\{0\}=\(x\_\{t\}\-\\alpha\_\{t\}\\,x\_\{1\}\)/\(1\-\\alpha\_\{t\}\)\. Substituting:

ut\|1\(xt∣x1\)\\displaystyle u\_\{t\|1\}\(x\_\{t\}\\mid x\_\{1\}\)=α˙t\(x1−xt−αtx11−αt\)=α˙t1−αt\(x1\(1−αt\)−xt\+αtx1\)\\displaystyle=\\dot\{\\alpha\}\_\{t\}\\\!\\left\(x\_\{1\}\-\\frac\{x\_\{t\}\-\\alpha\_\{t\}\\,x\_\{1\}\}\{1\-\\alpha\_\{t\}\}\\right\)=\\frac\{\\dot\{\\alpha\}\_\{t\}\}\{1\-\\alpha\_\{t\}\}\\\!\\left\(x\_\{1\}\(1\-\\alpha\_\{t\}\)\-x\_\{t\}\+\\alpha\_\{t\}\\,x\_\{1\}\\right\)=α˙t1−αt\(x1−xt\)\.\\displaystyle=\\boxed\{\\frac\{\\dot\{\\alpha\}\_\{t\}\}\{1\-\\alpha\_\{t\}\}\\,\(x\_\{1\}\-x\_\{t\}\)\.\}\(21\)For the common choiceαt=t\\alpha\_\{t\}=t, this simplifies tout\|1\(xt∣x1\)=\(x1−xt\)/\(1−t\)u\_\{t\|1\}\(x\_\{t\}\\mid x\_\{1\}\)=\(x\_\{1\}\-x\_\{t\}\)/\(1\-t\)\. During training, the equivalent formut\|1\(xt∣x1\)=α˙t\(x1−x0\)u\_\{t\|1\}\(x\_\{t\}\\mid x\_\{1\}\)=\\dot\{\\alpha\}\_\{t\}\\,\(x\_\{1\}\-x\_\{0\}\)can also be used\.

#### Spherical Case

Letω=d𝕊\(𝐳0,𝐳1\)∈\(0,π\)\\omega=d\_\{\\mathbb\{S\}\}\(\\mathbf\{z\}\_\{0\},\\mathbf\{z\}\_\{1\}\)\\in\(0,\\pi\)be the geodesic distance between the endpoints\. The conditional flow \([10](https://arxiv.org/html/2605.11125#S2.E10)\), written using the SLERP formula \([8](https://arxiv.org/html/2605.11125#S2.E8)\), is

𝐳t=sin⁡\(\(1−αt\)ω\)sin⁡ω𝐳0\+sin⁡\(αtω\)sin⁡ω𝐳1\.\\mathbf\{z\}\_\{t\}=\\frac\{\\sin\(\(1\-\\alpha\_\{t\}\)\\,\\omega\)\}\{\\sin\\omega\}\\,\\mathbf\{z\}\_\{0\}\+\\frac\{\\sin\(\\alpha\_\{t\}\\,\\omega\)\}\{\\sin\\omega\}\\,\\mathbf\{z\}\_\{1\}\.\(22\)Differentiating with respect tott:

d𝐳tdt=α˙tωsin⁡ω\[−cos⁡\(\(1−αt\)ω\)𝐳0\+cos⁡\(αtω\)𝐳1\]\.\\frac\{d\\mathbf\{z\}\_\{t\}\}\{dt\}=\\frac\{\\dot\{\\alpha\}\_\{t\}\\,\\omega\}\{\\sin\\omega\}\\left\[\-\\cos\(\(1\-\\alpha\_\{t\}\)\\,\\omega\)\\,\\mathbf\{z\}\_\{0\}\+\\cos\(\\alpha\_\{t\}\\,\\omega\)\\,\\mathbf\{z\}\_\{1\}\\right\]\.\(23\)Rewriting \([22](https://arxiv.org/html/2605.11125#A1.E22)\) gives𝐳0=sin⁡ωsin⁡\(\(1−αt\)ω\)𝐳t−sin⁡\(αtω\)sin⁡\(\(1−αt\)ω\)𝐳1\\mathbf\{z\}\_\{0\}=\\frac\{\\sin\\omega\}\{\\sin\(\(1\-\\alpha\_\{t\}\)\\,\\omega\)\}\\,\\mathbf\{z\}\_\{t\}\-\\frac\{\\sin\(\\alpha\_\{t\}\\,\\omega\)\}\{\\sin\(\(1\-\\alpha\_\{t\}\)\\,\\omega\)\}\\,\\mathbf\{z\}\_\{1\}\. Substituting into \([23](https://arxiv.org/html/2605.11125#A1.E23)\):

d𝐳tdt\\displaystyle\\frac\{d\\mathbf\{z\}\_\{t\}\}\{dt\}=α˙tωsin⁡ω\[−cos⁡\(\(1−αt\)ω\)sin⁡ωsin⁡\(\(1−αt\)ω\)𝐳t\+cos⁡\(\(1−αt\)ω\)sin⁡\(αtω\)\+cos⁡\(αtω\)sin⁡\(\(1−αt\)ω\)sin⁡\(\(1−αt\)ω\)𝐳1\]\.\\displaystyle=\\frac\{\\dot\{\\alpha\}\_\{t\}\\,\\omega\}\{\\sin\\omega\}\\left\[\-\\frac\{\\cos\(\(1\{\-\}\\alpha\_\{t\}\)\\omega\)\\,\\sin\\omega\}\{\\sin\(\(1\{\-\}\\alpha\_\{t\}\)\\omega\)\}\\,\\mathbf\{z\}\_\{t\}\+\\frac\{\\cos\(\(1\{\-\}\\alpha\_\{t\}\)\\omega\)\\,\\sin\(\\alpha\_\{t\}\\omega\)\+\\cos\(\\alpha\_\{t\}\\omega\)\\,\\sin\(\(1\{\-\}\\alpha\_\{t\}\)\\omega\)\}\{\\sin\(\(1\{\-\}\\alpha\_\{t\}\)\\omega\)\}\\,\\mathbf\{z\}\_\{1\}\\right\]\.\(24\)The numerator of the𝐳1\\mathbf\{z\}\_\{1\}coefficient simplifies by the following identity:cos⁡\(\(1−αt\)ω\)sin⁡\(αtω\)\+cos⁡\(αtω\)sin⁡\(\(1−αt\)ω\)=sin⁡ω\\cos\(\(1\{\-\}\\alpha\_\{t\}\)\\omega\)\\sin\(\\alpha\_\{t\}\\omega\)\+\\cos\(\\alpha\_\{t\}\\omega\)\\sin\(\(1\{\-\}\\alpha\_\{t\}\)\\omega\)=\\sin\\omega\. Therefore:

d𝐳tdt=α˙tωsin⁡\(\(1−αt\)ω\)\(𝐳1−cos⁡\(\(1−αt\)ω\)𝐳t\)\.\\frac\{d\\mathbf\{z\}\_\{t\}\}\{dt\}=\\frac\{\\dot\{\\alpha\}\_\{t\}\\,\\omega\}\{\\sin\(\(1\-\\alpha\_\{t\}\)\\,\\omega\)\}\\left\(\\mathbf\{z\}\_\{1\}\-\\cos\(\(1\-\\alpha\_\{t\}\)\\,\\omega\)\\,\\mathbf\{z\}\_\{t\}\\right\)\.\(25\)Recalling the definition of the logarithmic map \([7](https://arxiv.org/html/2605.11125#S2.E7)\) withd𝕊\(𝐳t,𝐳1\)=\(1−αt\)ωd\_\{\\mathbb\{S\}\}\(\\mathbf\{z\}\_\{t\},\\mathbf\{z\}\_\{1\}\)=\(1\-\\alpha\_\{t\}\)\\,\\omega, we recognize that \([25](https://arxiv.org/html/2605.11125#A1.E25)\) is proportional tolog𝐳t⁡\(𝐳1\)=\(1−αt\)ωsin⁡\(\(1−αt\)ω\)\(𝐳1−cos⁡\(\(1−αt\)ω\)𝐳t\)\\log\_\{\\mathbf\{z\}\_\{t\}\}\(\\mathbf\{z\}\_\{1\}\)=\\frac\{\(1\-\\alpha\_\{t\}\)\\,\\omega\}\{\\sin\(\(1\-\\alpha\_\{t\}\)\\,\\omega\)\}\(\\mathbf\{z\}\_\{1\}\-\\cos\(\(1\-\\alpha\_\{t\}\)\\,\\omega\)\\,\\mathbf\{z\}\_\{t\}\)\. Their ratio isα˙tω/\(\(1−αt\)ω\)=α˙t/\(1−αt\)\\dot\{\\alpha\}\_\{t\}\\omega/\(\(1\-\\alpha\_\{t\}\)\\omega\)=\\dot\{\\alpha\}\_\{t\}/\(1\-\\alpha\_\{t\}\), giving

ut\|1\(𝐳t∣𝐳1\)=α˙t1−αtlog𝐳t\(𝐳1\)\.\\boxed\{u\_\{t\|1\}\(\\mathbf\{z\}\_\{t\}\\mid\\mathbf\{z\}\_\{1\}\)=\\frac\{\\dot\{\\alpha\}\_\{t\}\}\{1\-\\alpha\_\{t\}\}\\,\\log\_\{\\mathbf\{z\}\_\{t\}\}\(\\mathbf\{z\}\_\{1\}\)\.\}\(26\)During training, the equivalent form from \([23](https://arxiv.org/html/2605.11125#A1.E23)\) can also be used\.

## Appendix BAdditional Details

### B\.1Training and Sampling Pseudocode

Algo\.[1](https://arxiv.org/html/2605.11125#alg1)shows the training pseudocode for𝕊\\mathbb\{S\}\-FLM, Algo\.[2](https://arxiv.org/html/2605.11125#alg2)the deterministic sampler, and Algo\.[3](https://arxiv.org/html/2605.11125#alg3)the stochastic sampler that replaces the posterior\-weighted velocity by the log map toward a single token sampled fromp1\|tθp^\{\\theta\}\_\{1\|t\}\.

Algorithm 1Training0:Dataset of token sequences, embedding table

𝐄\\mathbf\{E\}, scheduling function

αt\\alpha\_\{t\}
1:repeat

2:Sample sequence

𝐱∼p1\{\\mathbf\{x\}\}\\sim p\_\{1\}, time

t∼𝒰\[0,1\]t\\sim\\mathcal\{U\}\[0,1\]
3:for

ℓ=1\\ell=1to

LLdo

4:

𝐳1ℓ←𝐄\[𝐱ℓ\]/‖𝐄\[𝐱ℓ\]‖\\mathbf\{z\}\_\{1\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\leftarrow\\mathbf\{E\}\[\{\\mathbf\{x\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\]/\\\|\\mathbf\{E\}\[\{\\mathbf\{x\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\]\\\|⊳\\trianglerightEmbed and normalize

5:

ϵℓ∼𝒩\(𝟎,𝐈d\);𝐳0ℓ←ϵℓ/‖ϵℓ‖\\bm\{\\epsilon\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{d\}\);\\quad\\mathbf\{z\}\_\{0\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\leftarrow\\bm\{\\epsilon\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}/\\\|\\bm\{\\epsilon\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\\|⊳\\trianglerightSample noise on𝕊d−1\\mathbb\{S\}^\{d\-1\}

6:

𝐳tℓ←SLERP\(𝐳0ℓ,𝐳1ℓ,αt\)\\mathbf\{z\}\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\leftarrow\\mathrm\{SLERP\}\(\\mathbf\{z\}\_\{0\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\},\\mathbf\{z\}\_\{1\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\},\\alpha\_\{t\}\)⊳\\trianglerightConstruct noisy latent

7:endfor

8:Compute

ℒCE=−∑ℓlog⁡p1\|tθ\(𝐱ℓ∣𝐳t\)\\mathcal\{L\}\_\{\\text\{CE\}\}=\-\\sum\_\{\\ell\}\\log p^\{\\theta\}\_\{1\|t\}\(\{\\mathbf\{x\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\mid\\mathbf\{z\}\_\{t\}\)⊳\\trianglerightCross\-entropy loss

9:Update

θ\\thetaand

𝐄\\mathbf\{E\}by gradient descent on

ℒCE\\mathcal\{L\}\_\{\\text\{CE\}\}
10:untilconverged

Algorithm 2Sampling0:Denoiser

p1\|tθp^\{\\theta\}\_\{1\|t\}, embedding table

𝐄\\mathbf\{E\}with rows

𝐞v\\mathbf\{e\}\_\{v\}and unit\-norm versions

𝐞^v=𝐞v/‖𝐞v‖\\hat\{\\mathbf\{e\}\}\_\{v\}=\\mathbf\{e\}\_\{v\}/\\\|\\mathbf\{e\}\_\{v\}\\\|, number of steps

NN, decoder

pdecp\_\{\\text\{dec\}\}
1:for

ℓ=1\\ell=1to

LLdo

2:

ϵℓ∼𝒩\(𝟎,𝐈d\);𝐳0ℓ←ϵℓ/‖ϵℓ‖\\bm\{\\epsilon\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{d\}\);\\quad\\mathbf\{z\}\_\{0\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\leftarrow\\bm\{\\epsilon\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}/\\\|\\bm\{\\epsilon\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\\|⊳\\trianglerightSample noise on𝕊d−1\\mathbb\{S\}^\{d\-1\}

3:endfor

4:for

n=0n=0to

N−1N\{\-\}1do

5:

sn←\(αtn\+1−αtn\)/\(1−αtn\)s\_\{n\}\\leftarrow\(\\alpha\_\{t\_\{n\+1\}\}\-\\alpha\_\{t\_\{n\}\}\)\\,/\\,\(1\-\\alpha\_\{t\_\{n\}\}\)⊳\\trianglerightStep size

6:Compute

p1\|tnθ\(⋅∣𝐳tn\)p^\{\\theta\}\_\{1\|t\_\{n\}\}\(\\cdot\\mid\\mathbf\{z\}\_\{t\_\{n\}\}\)⊳\\trianglerightOne forward pass

7:for

ℓ=1\\ell=1to

LLdo

8:

𝐯¯ℓ←∑v∈𝒱p1\|tnθ\(v∣𝐳tn\)log𝐳tnℓ⁡\(𝐞^v\)\\bar\{\\mathbf\{v\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\leftarrow\\sum\_\{v\\in\\mathcal\{V\}\}p^\{\\theta\}\_\{1\|t\_\{n\}\}\(v\\mid\\mathbf\{z\}\_\{t\_\{n\}\}\)\\,\\log\_\{\\mathbf\{z\}\_\{t\_\{n\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\}\\\!\\left\(\\hat\{\\mathbf\{e\}\}\_\{v\}\\right\)⊳\\trianglerightPosterior\-weighted velocity

9:

𝐳tn\+1ℓ←exp𝐳tnℓ⁡\(sn⋅𝐯¯ℓ\)\\mathbf\{z\}\_\{t\_\{n\+1\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\leftarrow\\exp\_\{\\mathbf\{z\}\_\{t\_\{n\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\}\\\!\\left\(s\_\{n\}\\cdot\\bar\{\\mathbf\{v\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\right\)⊳\\trianglerightGeodesic Euler step

10:endfor

11:endfor

12:

𝐱∼pdec\(⋅∣𝐳1\)\{\\mathbf\{x\}\}\\sim p\_\{\\text\{dec\}\}\(\\cdot\\mid\\mathbf\{z\}\_\{1\}\)⊳\\trianglerightDecode \(argmax or AR\)

13:return

𝐱\{\\mathbf\{x\}\}

Algorithm 3Stochastic sampling0:Denoiser

p1\|tθp^\{\\theta\}\_\{1\|t\}, embedding table

𝐄\\mathbf\{E\}with rows

𝐞v\\mathbf\{e\}\_\{v\}and unit\-norm versions

𝐞^v=𝐞v/‖𝐞v‖\\hat\{\\mathbf\{e\}\}\_\{v\}=\\mathbf\{e\}\_\{v\}/\\\|\\mathbf\{e\}\_\{v\}\\\|, number of steps

NN, decoder

pdecp\_\{\\text\{dec\}\}
1:for

ℓ=1\\ell=1to

LLdo

2:

ϵℓ∼𝒩\(𝟎,𝐈d\);𝐳0ℓ←ϵℓ/‖ϵℓ‖\\bm\{\\epsilon\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{d\}\);\\quad\\mathbf\{z\}\_\{0\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\leftarrow\\bm\{\\epsilon\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}/\\\|\\bm\{\\epsilon\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\\|⊳\\trianglerightSample noise on𝕊d−1\\mathbb\{S\}^\{d\-1\}

3:endfor

4:for

n=0n=0to

N−1N\{\-\}1do

5:

sn←\(αtn\+1−αtn\)/\(1−αtn\)s\_\{n\}\\leftarrow\(\\alpha\_\{t\_\{n\+1\}\}\-\\alpha\_\{t\_\{n\}\}\)\\,/\\,\(1\-\\alpha\_\{t\_\{n\}\}\)⊳\\trianglerightStep size

6:Compute

p1\|tnθ\(⋅∣𝐳tn\)p^\{\\theta\}\_\{1\|t\_\{n\}\}\(\\cdot\\mid\\mathbf\{z\}\_\{t\_\{n\}\}\)⊳\\trianglerightOne forward pass

7:for

ℓ=1\\ell=1to

LLdo

8:Sample

𝐱^ℓ∼p1\|tnθ\(⋅∣𝐳tn\)\\hat\{\{\\mathbf\{x\}\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\sim p^\{\\theta\}\_\{1\|t\_\{n\}\}\(\\cdot\\mid\\mathbf\{z\}\_\{t\_\{n\}\}\)⊳\\trianglerightPosterior sample

9:

𝐯¯ℓ←log𝐳tnℓ⁡\(𝐞^𝐱^ℓ\)\\bar\{\\mathbf\{v\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\leftarrow\\log\_\{\\mathbf\{z\}\_\{t\_\{n\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\}\\\!\\left\(\\hat\{\\mathbf\{e\}\}\_\{\\hat\{\{\\mathbf\{x\}\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\}\\right\)⊳\\trianglerightVelocity toward sampled token

10:

𝐳tn\+1ℓ←exp𝐳tnℓ⁡\(sn⋅𝐯¯ℓ\)\\mathbf\{z\}\_\{t\_\{n\+1\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\leftarrow\\exp\_\{\\mathbf\{z\}\_\{t\_\{n\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\}\\\!\\left\(s\_\{n\}\\cdot\\bar\{\\mathbf\{v\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\right\)⊳\\trianglerightGeodesic Euler step

11:endfor

12:endfor

13:

𝐱∼pdec\(⋅∣𝐳1\)\{\\mathbf\{x\}\}\\sim p\_\{\\text\{dec\}\}\(\\cdot\\mid\\mathbf\{z\}\_\{1\}\)⊳\\trianglerightDecode \(argmax or AR\)

14:return

𝐱\{\\mathbf\{x\}\}

### B\.2Adaptive Noise Schedule

#### Motivation

The denoising task is not as meaningful at every noise level\. At low noise levels, the task is trivial since the noisy embeddings are very close to the clean embeddings \([16](https://arxiv.org/html/2605.11125#S3.E16)\)\. At high noise levels, the latent𝐳tℓ\\mathbf\{z\}\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}is close to pure noise hence carries little signal about the clean token\. In both cases the denoiserp1\|tθp^\{\\theta\}\_\{1\|t\}has little to learn\. The informative region lies in between, where the model can extract signal from noise and the loss drops fastest as a function oftt\. Inspired by InfoNoise\[[73](https://arxiv.org/html/2605.11125#bib.bib73)\], we focus training on regions where the loss derivative\|dℒ/dt\|\|d\\mathcal\{L\}/dt\|is largest\.

Algorithm 4Adaptive noise schedule refit0:buffer

\{\(ti,ℒi\)\}\\\{\(t\_\{i\},\\mathcal\{L\}\_\{i\}\)\\\}, base schedule

αbase\\alpha^\{\\text\{base\}\}, grid

\{tj\}j=1N⊂\[0,1\]\\\{t\_\{j\}\\\}\_\{j=1\}^\{N\}\\subset\[0,1\], EMA rate

β\\beta, uniform mix

μ\\mu, ridge

λ\\lambda, refit count

nn
1:Fit a spline

ℒ^\(t\)\\hat\{\\mathcal\{L\}\}\(t\)to

\(ti,ℒi\)\(t\_\{i\},\\mathcal\{L\}\_\{i\}\)via ridge regression \(

λ\\lambda\)

2:

g\(tj\)←max⁡\{dℒ^/dt\(tj\),0\}g\(t\_\{j\}\)\\leftarrow\\max\\\{d\\hat\{\\mathcal\{L\}\}/dt\\,\(t\_\{j\}\),\\,0\\\}⊳\\trianglerightClamp: loss should grow with noise

3:

w\(tj\)←\(1−μ\)g\(tj\)\+μw\(t\_\{j\}\)\\leftarrow\(1\-\\mu\)\\,g\(t\_\{j\}\)\+\\mu⊳\\trianglerightEnsure the CDF is invertible

4:

F\(tj\)←∑j′≤jw\(tj′\)/∑j′w\(tj′\)F\(t\_\{j\}\)\\leftarrow\\sum\_\{j^\{\\prime\}\\leq j\}w\(t\_\{j^\{\\prime\}\}\)\\big/\\sum\_\{j^\{\\prime\}\}w\(t\_\{j^\{\\prime\}\}\)⊳\\trianglerightEmpirical CDF \(discrete points\)

5:

t~j←F−1\(tj\)\\tilde\{t\}\_\{j\}\\leftarrow F^\{\-1\}\(t\_\{j\}\)via PCHIP interpolation⊳\\trianglerightInvert CDF

6:

α¯j←βα¯j\+\(1−β\)αbase\(t~j\)\\bar\{\\alpha\}\_\{j\}\\leftarrow\\beta\\,\\bar\{\\alpha\}\_\{j\}\+\(1\-\\beta\)\\,\\alpha^\{\\text\{base\}\}\(\\tilde\{t\}\_\{j\}\)⊳\\trianglerightEMA update

7:

αjfinal←α¯j/\(1−βn\)\\alpha^\{\\text\{final\}\}\_\{j\}\\leftarrow\\bar\{\\alpha\}\_\{j\}/\(1\-\\beta^\{n\}\)⊳\\trianglerightBias correction

8:Store

αtadapt\\alpha^\{\\text\{adapt\}\}\_\{t\}as a PCHIP spline through

\{\(tj,αjfinal\)\}\\\{\(t\_\{j\},\\alpha^\{\\text\{final\}\}\_\{j\}\)\\\}

During training we append\(t,ℒ\)\(t,\\mathcal\{L\}\)pairs to the buffer at every step and invoke Algo\.[4](https://arxiv.org/html/2605.11125#alg4)everyRRsteps after a warmup ofW=1000W=1000steps\. The EMA is essential to reduce oscillations in the schedule, and the Adam\-style bias correction1/\(1−βn\)1/\(1\-\\beta^\{n\}\)\[[43](https://arxiv.org/html/2605.11125#bib.bib43)\]prevents the early\-training estimate from being biased toward zero\. By default we useR=50R=50, buffer sizeR⋅batch sizeR\\cdot\\text\{batch size\},β=0\.9\\beta=0\.9, andμ=10−3\\mu=10^\{\-3\}\. All operations run on CPU innumpy, so the impact on training throughput is minimal\. Empirically, the noise schedule stabilizes quickly\.

#### Implementation

We fit the loss profile using a spline \(sklearn\.preprocessing\.SplineTransformer\) with ridge regression \(sklearn\.linear\_model\.Ridge\)\. For interpolating the inverse CDFF−1F^\{\-1\}and the final noise scheduleαtadapt\\alpha^\{\\text\{adapt\}\}\_\{t\}, we usescipy\.interpolate\.PchipInterpolator, which builds a*Piecewise Cubic Hermite Interpolating Polynomial*\[[27](https://arxiv.org/html/2605.11125#bib.bib27)\]that preserves the monotonicity of the data\.

#### Relation to InfoNoise

Our adaptive schedule is inspired by InfoNoise\[[73](https://arxiv.org/html/2605.11125#bib.bib73)\], which also adapts the noise schedule online from the value of the loss\. We differ in several ways\.

- •Importance signal\.InfoNoise estimates the conditional entropy rateH˙\[𝐱0∣𝐱σ\]∝mmse\(σ\)/σ3\\dot\{H\}\[\\mathbf\{x\}\_\{0\}\\mid\\mathbf\{x\}\_\{\\sigma\}\]\\propto\\mathrm\{mmse\}\(\\sigma\)/\\sigma^\{3\}via the I–MMSE identity\. We use\|dℒ^/dt\|\|d\\hat\{\\mathcal\{L\}\}/dt\|directly on the cross\-entropy profile\.
- •Estimator\.InfoNoise splits theσ\\sigma\-range into bins, each with a FIFO buffer for recent losses and an EMA on the MSE estimate\. We fit a spline to\(t,ℒ\)\(t,\\mathcal\{L\}\)pairs via ridge regression\. This removes the need to choose a number of bins\.
- •Stabilization\.InfoNoise uses pivot calibration and gating to suppress boundary artifacts\. We enforce a minimum CDF incrementμ\\muto keep the CDF invertible and stabilize the schedule with an EMA, in addition to truncating the noise schedule range \([16](https://arxiv.org/html/2605.11125#S3.E16)\)\.
- •Loss weight\.InfoNoise separates sampling densityπ\(σ\)\\pi\(\\sigma\)from loss weightw\(σ\)w\(\\sigma\)via the effective weightϕ=π⋅w\\phi=\\pi\\cdot w\. We use unweighted cross\-entropy \(w≡1w\\equiv 1\), so the sampling density matches effective weight directly\.

### B\.3Hyperspherical Backbone Architecture

In this section, we present the hyperspherical denoising backbone in more details\. The architecture is adapted from nGPT\[[50](https://arxiv.org/html/2605.11125#bib.bib50)\]for𝕊\\mathbb\{S\}\-FLM\. The main differences are that \(1\) we use bidirectional attention, \(2\) we use GELU activations in the MLP, since the other AR / diffusion denoisers also use GELU, \(3\) we make the residual gates time\-conditional\. As in nGPT, every weight matrix has unit\-norm vectors along the embedding\-dimension axis\. We enforce this at initialization and optionally re\-apply the projection after every optimizer step\.

#### Notation

LetNorm\(𝐮\)=𝐮/max⁡\(‖𝐮‖2,ε\)\\mathrm\{Norm\}\(\\mathbf\{u\}\)=\\mathbf\{u\}/\\max\(\\\|\\mathbf\{u\}\\\|\_\{2\},\\varepsilon\)withε=10−6\\varepsilon=10^\{\-6\}project a vector onto the unit sphere along its last axis\. We writeddfor the embedding dimension,HHfor the number of attention heads,dk=d/Hd\_\{k\}=d/H, andNNfor the number of transformer blocks\. The base scaleb=1/db=1/\\sqrt\{d\}appears in several rescaling parameters\. All per\-dimension scales \(𝐬qk,𝐬fc,𝐬z\\mathbf\{s\}\_\{qk\},\\mathbf\{s\}\_\{fc\},\\mathbf\{s\}\_\{z\}\) and residual gates \(𝜸A,𝜸M\\bm\{\\gamma\}\_\{A\},\\bm\{\\gamma\}\_\{M\}\) introduced below are learnable\.

#### Normalized attention

The multi\-head attention uses bias\-free linear layers:𝐐=𝐡𝐖Q\\mathbf\{Q\}=\\mathbf\{h\}\\mathbf\{W\}\_\{Q\},𝐊=𝐡𝐖K\\mathbf\{K\}=\\mathbf\{h\}\\mathbf\{W\}\_\{K\},𝐕=𝐡𝐖V\\mathbf\{V\}=\\mathbf\{h\}\\mathbf\{W\}\_\{V\}, split intoHHheads\. RoPE\[[87](https://arxiv.org/html/2605.11125#bib.bib87)\]is applied to𝐐,𝐊\\mathbf\{Q\},\\mathbf\{K\}\. We then apply the normalization and rescaling:

𝐬~qk=𝐬qk⋅sqkinit/sqkscale,𝐐′=𝐬~qk⊙Norm\(𝐐\),𝐊′=𝐬~qk⊙Norm\(𝐊\),\\tilde\{\\mathbf\{s\}\}\_\{qk\}=\\mathbf\{s\}\_\{qk\}\\cdot s\_\{qk\}^\{\\mathrm\{init\}\}/s\_\{qk\}^\{\\mathrm\{scale\}\},\\qquad\\mathbf\{Q\}^\{\\prime\}=\\tilde\{\\mathbf\{s\}\}\_\{qk\}\\odot\\mathrm\{Norm\}\(\\mathbf\{Q\}\),\\quad\\mathbf\{K\}^\{\\prime\}=\\tilde\{\\mathbf\{s\}\}\_\{qk\}\\odot\\mathrm\{Norm\}\(\\mathbf\{K\}\),withsqkinit=1s\_\{qk\}^\{\\mathrm\{init\}\}=1andsqkscale=bs\_\{qk\}^\{\\mathrm\{scale\}\}=b\. The softmax scaling follows from the variance of the inner product⟨𝐪^,𝐤^⟩\\langle\\hat\{\\mathbf\{q\}\},\\hat\{\\mathbf\{k\}\}\\ranglebetween two unit\-norm vectors𝐪^=Norm\(𝐐\)\\hat\{\\mathbf\{q\}\}=\\mathrm\{Norm\}\(\\mathbf\{Q\}\)and𝐤^=Norm\(𝐊\)\\hat\{\\mathbf\{k\}\}=\\mathrm\{Norm\}\(\\mathbf\{K\}\)\. Assume the components of𝐐,𝐊\\mathbf\{Q\},\\mathbf\{K\}are i\.i\.d\. zero\-mean Gaussian\. After projection byNorm\\mathrm\{Norm\},𝐪^\\hat\{\\mathbf\{q\}\}and𝐤^\\hat\{\\mathbf\{k\}\}are independent samples fromUnif\(𝕊dk−1\)\\mathrm\{Unif\}\(\\mathbb\{S\}^\{d\_\{k\}\-1\}\)\. By rotational symmetry and‖𝐮‖2=1\\\|\\mathbf\{u\}\\\|^\{2\}=1, any𝐮∼Unif\(𝕊dk−1\)\\mathbf\{u\}\\sim\\mathrm\{Unif\}\(\\mathbb\{S\}^\{d\_\{k\}\-1\}\)satisfies𝔼\[ui\]=0\\mathbb\{E\}\[u\_\{i\}\]=0and𝔼\[uiuj\]=δij/dk\\mathbb\{E\}\[u\_\{i\}u\_\{j\}\]=\\delta\_\{ij\}/d\_\{k\}, hence𝔼\[⟨𝐪^,𝐤^⟩\]=0\\mathbb\{E\}\[\\langle\\hat\{\\mathbf\{q\}\},\\hat\{\\mathbf\{k\}\}\\rangle\]=0andVar\[⟨𝐪^,𝐤^⟩\]=1/dk\\mathrm\{Var\}\[\\langle\\hat\{\\mathbf\{q\}\},\\hat\{\\mathbf\{k\}\}\\rangle\]=1/d\_\{k\}\. Following nGPT, we therefore set the softmax scale todk\\sqrt\{d\_\{k\}\}:

𝐎=softmax\(dk𝐐′𝐊′⁣⊤\)𝐕,𝐡A=𝐎𝐖O\.\\mathbf\{O\}=\\mathrm\{softmax\}\\\!\\bigl\(\\sqrt\{d\_\{k\}\}\\,\\mathbf\{Q\}^\{\\prime\}\\mathbf\{K\}^\{\\prime\\top\}\\bigr\)\\mathbf\{V\},\\qquad\\mathbf\{h\}\_\{A\}=\\mathbf\{O\}\\,\\mathbf\{W\}\_\{O\}\.

#### Normalized MLP

The MLP is similar to the MLP in discrete\-diffusion DiTs\[[66](https://arxiv.org/html/2605.11125#bib.bib66),[52](https://arxiv.org/html/2605.11125#bib.bib52)\], with the addition of a per\-dimension scale𝐬fc\\mathbf\{s\}\_\{fc\}on the hidden activations:

𝐡M=GELU\(d𝐬fc⊙\(𝐡𝐖fc\)\)𝐖OM\.\\mathbf\{h\}\_\{M\}=\\mathrm\{GELU\}\\bigl\(\\sqrt\{d\}\\,\\mathbf\{s\}\_\{fc\}\\odot\(\\mathbf\{h\}\\,\\mathbf\{W\}\_\{fc\}\)\\bigr\)\\,\\mathbf\{W\}\_\{O\}^\{M\}\.As in the previous section, thed\\sqrt\{d\}factor restores unit variance to the inputs of GELU\.

#### Timestep injection

Let𝝉\(t\)=SiLU\(TimestepEmbedder\(σ\(t\)\)\)∈ℝdcond\\bm\{\\tau\}\(t\)=\\mathrm\{SiLU\}\(\\mathrm\{TimestepEmbedder\}\(\\sigma\(t\)\)\)\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{cond\}\}\}be the timestep embedding \(sinusoidal embedding followed by a two\-layer MLP\)\. Each transformer block has its own zero\-initialized linear map𝐖𝜹∈ℝ2d×dcond\\mathbf\{W\}\_\{\\bm\{\\delta\}\}\\in\\mathbb\{R\}^\{2d\\times d\_\{\\mathrm\{cond\}\}\}that computes per\-dimension biases\(𝜹A\(t\),𝜹M\(t\)\)=𝐖𝜹𝝉\(t\)\(\\bm\{\\delta\}\_\{A\}\(t\),\\bm\{\\delta\}\_\{M\}\(t\)\)=\\mathbf\{W\}\_\{\\bm\{\\delta\}\}\\bm\{\\tau\}\(t\), for the residual update \(next paragraph\)\.

#### Residual update

The block updates the hidden state by interpolating with the \(normalized\) attention output, and then with the \(normalized\) MLP output\. Each interpolation is gated by learnable per\-dimension parameters𝜸A,𝜸M∈ℝd\\bm\{\\gamma\}\_\{A\},\\bm\{\\gamma\}\_\{M\}\\in\\mathbb\{R\}^\{d\}, and renormalized:

𝐡\\displaystyle\\mathbf\{h\}←Norm\(Norm\(𝐡\)\+𝜸~A⊙\(Norm\(𝐡A\)−Norm\(𝐡\)\)\),\\displaystyle\\leftarrow\\mathrm\{Norm\}\\Bigl\(\\mathrm\{Norm\}\(\\mathbf\{h\}\)\+\\tilde\{\\bm\{\\gamma\}\}\_\{A\}\\odot\\bigl\(\\mathrm\{Norm\}\(\\mathbf\{h\}\_\{A\}\)\-\\mathrm\{Norm\}\(\\mathbf\{h\}\)\\bigr\)\\Bigr\),𝐡\\displaystyle\\mathbf\{h\}←Norm\(Norm\(𝐡\)\+𝜸~M⊙\(Norm\(𝐡M\)−Norm\(𝐡\)\)\)\.\\displaystyle\\leftarrow\\mathrm\{Norm\}\\Bigl\(\\mathrm\{Norm\}\(\\mathbf\{h\}\)\+\\tilde\{\\bm\{\\gamma\}\}\_\{M\}\\odot\\bigl\(\\mathrm\{Norm\}\(\\mathbf\{h\}\_\{M\}\)\-\\mathrm\{Norm\}\(\\mathbf\{h\}\)\\bigr\)\\Bigr\)\.The effective gate is𝜸~X=\|𝜸X⋅\(γinit/γscale\)\+𝜹X\(t\)\|\\tilde\{\\bm\{\\gamma\}\}\_\{X\}=\|\\bm\{\\gamma\}\_\{X\}\\cdot\(\\gamma^\{\\mathrm\{init\}\}/\\gamma^\{\\mathrm\{scale\}\}\)\+\\bm\{\\delta\}\_\{X\}\(t\)\|withγinit=0\.05\\gamma^\{\\mathrm\{init\}\}=0\.05andγscale=b\\gamma^\{\\mathrm\{scale\}\}=b, forX∈\{A,M\}X\\in\\\{A,M\\\}, where𝜹X\(t\)\\bm\{\\delta\}\_\{X\}\(t\)is the time modulation defined above\. Except for the time modulation𝜹X\(t\)\\bm\{\\delta\}\_\{X\}\(t\), the design and hyperparameters, are carried from nGPT\. The absolute value ensures the gate is non\-negative, so the update always pulls𝐡\\mathbf\{h\}toward the block output\.

#### Output head

We project the output of the last block through a row\-normalized linear map𝐖lm∈ℝ\|𝒱\|×d\\mathbf\{W\}\_\{\\mathrm\{lm\}\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\}and rescale by a learnable scale𝐬z∈ℝ\|𝒱\|\\mathbf\{s\}\_\{z\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\}:

𝐬~z=𝐬z⋅szinit/szscale,ℓ=𝐬~z⊙\(𝐡𝐖lm⊤\),\\tilde\{\\mathbf\{s\}\}\_\{z\}=\\mathbf\{s\}\_\{z\}\\cdot s\_\{z\}^\{\\mathrm\{init\}\}/s\_\{z\}^\{\\mathrm\{scale\}\},\\qquad\\bm\{\\ell\}=\\tilde\{\\mathbf\{s\}\}\_\{z\}\\odot\(\\mathbf\{h\}\\,\\mathbf\{W\}\_\{\\mathrm\{lm\}\}^\{\\top\}\),withszinit=1s\_\{z\}^\{\\mathrm\{init\}\}=1andszscale=bs\_\{z\}^\{\\mathrm\{scale\}\}=b\. The rescaling is necessary because𝐡𝐖lm⊤\\mathbf\{h\}\\mathbf\{W\}\_\{\\mathrm\{lm\}\}^\{\\top\}contains inner products of unit vectors and is therefore bounded in\[−1,1\]\|𝒱\|\[\-1,1\]^\{\|\\mathcal\{V\}\|\}\. Without𝐬z\\mathbf\{s\}\_\{z\}the cross\-entropy would saturate near a uniform distribution\.

### B\.4Step Size with the Euler Sampler

The marginal velocity from \([15](https://arxiv.org/html/2605.11125#S3.E15)\) isutℓ=α˙t1−αtu¯tℓu\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}=\\frac\{\\dot\{\\alpha\}\_\{t\}\}\{1\-\\alpha\_\{t\}\}\\,\\bar\{u\}\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}, whereu¯tℓ=∑vp1\|t\(𝐱ℓ=v∣𝐳t\)log𝐳tℓ⁡\(𝐞^v\)\\bar\{u\}\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}=\\sum\_\{v\}p\_\{1\|t\}\(\{\\mathbf\{x\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}=v\\mid\\mathbf\{z\}\_\{t\}\)\\,\\log\_\{\\mathbf\{z\}\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\}\(\\hat\{\\mathbf\{e\}\}\_\{v\}\)\. During sampling, to take an Euler step of sizeΔt\\Delta tfrom time timetnt\_\{n\}totn\+1t\_\{n\+1\}, we compute

𝐳tn\+1ℓ=exp𝐳tnℓ⁡\(Δt⋅utnℓ\)=exp𝐳tnℓ⁡\(α˙tnΔt1−αtn⋅u¯tnℓ\)\.\\mathbf\{z\}\_\{t\_\{n\+1\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}=\\exp\_\{\\mathbf\{z\}\_\{t\_\{n\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\}\\\!\\left\(\\Delta t\\cdot u\_\{t\_\{n\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\right\)=\\exp\_\{\\mathbf\{z\}\_\{t\_\{n\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\}\\\!\\left\(\\frac\{\\dot\{\\alpha\}\_\{t\_\{n\}\}\\,\\Delta t\}\{1\-\\alpha\_\{t\_\{n\}\}\}\\cdot\\bar\{u\}\_\{t\_\{n\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\\right\)\.\(27\)Using the first\-order approximationα˙tnΔt≈αtn\+1−αtn\\dot\{\\alpha\}\_\{t\_\{n\}\}\\,\\Delta t\\approx\\alpha\_\{t\_\{n\+1\}\}\-\\alpha\_\{t\_\{n\}\}, the step size applied tou¯tℓ\\bar\{u\}\_\{t\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}is

sn=αtn\+1−αtn1−αtn,s\_\{n\}=\\frac\{\\alpha\_\{t\_\{n\+1\}\}\-\\alpha\_\{t\_\{n\}\}\}\{1\-\\alpha\_\{t\_\{n\}\}\},\(28\)so that𝐳tn\+1ℓ=exp𝐳tnℓ⁡\(sn⋅u¯tnℓ\)\\mathbf\{z\}\_\{t\_\{n\+1\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}=\\exp\_\{\\mathbf\{z\}\_\{t\_\{n\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\}\(s\_\{n\}\\cdot\\bar\{u\}\_\{t\_\{n\}\}^\{\\color\[rgb\]\{0\.46875,0\.46875,0\.46875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.46875,0\.46875,0\.46875\}\\ell\}\)\. For the linear scheduleαt=t\\alpha\_\{t\}=twith uniform time stepstn=n/Nt\_\{n\}=n/N, this givessn=1/\(N−n\)s\_\{n\}=1/\(N\-n\)\.

### B\.5Re\-Normalizing Embeddings

We show that when embeddings are normalized during the forward pass, the norm of the embeddings grow at every optimization step, and thus we obtain a decay of the effective learning rate \(since the angular updates become smaller\)\. Our argument is similar to prior work\[[41](https://arxiv.org/html/2605.11125#bib.bib41)\]\.

#### Setup

Let𝐞∈ℝd\\mathbf\{e\}\\in\\mathbb\{R\}^\{d\}be an*unconstrained*embedding vector, and let𝐞^=𝐞/‖𝐞‖\\hat\{\\mathbf\{e\}\}=\\mathbf\{e\}/\\\|\\mathbf\{e\}\\\|be the normalized version used in the forward pass\. The lossℒ\\mathcal\{L\}depends on𝐞\\mathbf\{e\}only through𝐞^\\hat\{\\mathbf\{e\}\}\.

#### Jacobian of the normalization

The Jacobian of𝐞^\\hat\{\\mathbf\{e\}\}with respect to𝐞\\mathbf\{e\}is

∂𝐞^∂𝐞=1‖𝐞‖\(𝐈−𝐞^𝐞^⊤\)\.\\frac\{\\partial\\hat\{\\mathbf\{e\}\}\}\{\\partial\\mathbf\{e\}\}=\\frac\{1\}\{\\\|\\mathbf\{e\}\\\|\}\\left\(\\mathbf\{I\}\-\\hat\{\\mathbf\{e\}\}\\,\\hat\{\\mathbf\{e\}\}^\{\\top\}\\right\)\.\(29\)By the chain rule, the gradient ofℒ\\mathcal\{L\}with respect to the unnormalized embedding is

∇𝐞ℒ=\(∂𝐞^∂𝐞\)⊤∇𝐞^ℒ=1‖𝐞‖\(𝐈−𝐞^𝐞^⊤\)∇𝐞^ℒ,\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}=\\left\(\\frac\{\\partial\\hat\{\\mathbf\{e\}\}\}\{\\partial\\mathbf\{e\}\}\\right\)^\{\\top\}\\nabla\_\{\\hat\{\\mathbf\{e\}\}\}\\mathcal\{L\}=\\frac\{1\}\{\\\|\\mathbf\{e\}\\\|\}\\left\(\\mathbf\{I\}\-\\hat\{\\mathbf\{e\}\}\\,\\hat\{\\mathbf\{e\}\}^\{\\top\}\\right\)\\nabla\_\{\\hat\{\\mathbf\{e\}\}\}\\mathcal\{L\},\(30\)where we used the fact that𝐈−𝐞^𝐞^⊤\\mathbf\{I\}\-\\hat\{\\mathbf\{e\}\}\\,\\hat\{\\mathbf\{e\}\}^\{\\top\}is symmetric\.

#### The gradient is orthogonal to𝐞\\mathbf\{e\}

The matrix𝐏=𝐈−𝐞^𝐞^⊤\\mathbf\{P\}=\\mathbf\{I\}\-\\hat\{\\mathbf\{e\}\}\\,\\hat\{\\mathbf\{e\}\}^\{\\top\}projects onto the subspace orthogonal to𝐞^\\hat\{\\mathbf\{e\}\}\. Since𝐞=‖𝐞‖𝐞^\\mathbf\{e\}=\\\|\\mathbf\{e\}\\\|\\,\\hat\{\\mathbf\{e\}\}:

𝐞⊤∇𝐞ℒ\\displaystyle\\mathbf\{e\}^\{\\top\}\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}=1‖𝐞‖𝐞⊤\(𝐈−𝐞^𝐞^⊤\)∇𝐞^ℒ\\displaystyle=\\frac\{1\}\{\\\|\\mathbf\{e\}\\\|\}\\,\\mathbf\{e\}^\{\\top\}\\\!\\left\(\\mathbf\{I\}\-\\hat\{\\mathbf\{e\}\}\\,\\hat\{\\mathbf\{e\}\}^\{\\top\}\\right\)\\nabla\_\{\\hat\{\\mathbf\{e\}\}\}\\mathcal\{L\}=1‖𝐞‖\(𝐞⊤−𝐞⊤𝐞^⏟=‖𝐞‖𝐞^⊤\)∇𝐞^ℒ\\displaystyle=\\frac\{1\}\{\\\|\\mathbf\{e\}\\\|\}\\left\(\\mathbf\{e\}^\{\\top\}\-\\underbrace\{\\mathbf\{e\}^\{\\top\}\\hat\{\\mathbf\{e\}\}\}\_\{=\\,\\\|\\mathbf\{e\}\\\|\}\\,\\hat\{\\mathbf\{e\}\}^\{\\top\}\\right\)\\nabla\_\{\\hat\{\\mathbf\{e\}\}\}\\mathcal\{L\}=1‖𝐞‖\(𝐞⊤−‖𝐞‖𝐞^⊤\)∇𝐞^ℒ\\displaystyle=\\frac\{1\}\{\\\|\\mathbf\{e\}\\\|\}\\left\(\\mathbf\{e\}^\{\\top\}\-\\\|\\mathbf\{e\}\\\|\\,\\hat\{\\mathbf\{e\}\}^\{\\top\}\\right\)\\nabla\_\{\\hat\{\\mathbf\{e\}\}\}\\mathcal\{L\}=1‖𝐞‖\(𝐞⊤−𝐞⊤\)∇𝐞^ℒ=0,\\displaystyle=\\frac\{1\}\{\\\|\\mathbf\{e\}\\\|\}\\left\(\\mathbf\{e\}^\{\\top\}\-\\mathbf\{e\}^\{\\top\}\\right\)\\nabla\_\{\\hat\{\\mathbf\{e\}\}\}\\mathcal\{L\}=0,\(31\)where we used‖𝐞‖𝐞^=𝐞\\\|\\mathbf\{e\}\\\|\\,\\hat\{\\mathbf\{e\}\}=\\mathbf\{e\}\.

#### Norm growth

Since∇𝐞ℒ⟂𝐞\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}\\perp\\mathbf\{e\}, a gradient step𝐞′=𝐞−η∇𝐞ℒ\\mathbf\{e\}^\{\\prime\}=\\mathbf\{e\}\-\\eta\\,\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}gives, by the Pythagorean theorem:

‖𝐞′‖2=‖𝐞‖2\+η2‖∇𝐞ℒ‖2\>‖𝐞‖2\.\\\|\\mathbf\{e\}^\{\\prime\}\\\|^\{2\}=\\\|\\mathbf\{e\}\\\|^\{2\}\+\\eta^\{2\}\\,\\\|\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}\\\|^\{2\}\>\\\|\\mathbf\{e\}\\\|^\{2\}\.\(32\)The norm grows at every step, which induces an effective learning rate decay on the embedding table\.

#### Fix

After each optimization step, we can re\-project𝐞←𝐞/‖𝐞‖\\mathbf\{e\}\\leftarrow\\mathbf\{e\}/\\\|\\mathbf\{e\}\\\|\(not through the computational graph, but by directly modifying the parameters\)\. This ensures that‖𝐞‖=1\\\|\\mathbf\{e\}\\\|=1and that there is no implicit learning rate annealing\[[41](https://arxiv.org/html/2605.11125#bib.bib41)\]\. Empirically, we find that the learning rate annealing on the embedding table is beneficial\.

### B\.6Training Cost

Training on Sudokus is fast and cheap\. We use a single L40S GPU and train for less than 2 hours\. Using the on\-demand price on RunPod \([https://www\.runpod\.io](https://www.runpod.io/)\), training costs less than $2\.

Table[3](https://arxiv.org/html/2605.11125#A2.T3)shows the step per second and total cost on TinyGSM and OpenWebText\. We train on H100 GPUs with bfloat16 mixed precision\. We use 8 H100s for TinyGSM, and 16 H100s on OpenWebtext\. We train for 250k steps on TinyGSM \(Sec\.[4\.2](https://arxiv.org/html/2605.11125#S4.SS2)\) and 1M on OpenWebText \(Sec\.[4\.3](https://arxiv.org/html/2605.11125#S4.SS3)\)\.

Table 3:Training coston TinyGSM and OpenWebText\. The total duration and GPU\-hours are derived from the step/sec\. We did not investigate deeply why𝕊\\mathbb\{S\}\-FLM trains faster than MDLM\. We did not try to hyper\-optimize our implementations\. For MDLM, CANDI and𝕊\\mathbb\{S\}\-FLM, we import their original implementation, based on the Duo codebase \([https://github\.com/s\-sahoo/duo](https://github.com/s-sahoo/duo)\), since we also use it\. It is plausible that MDLM / Duo can be made to train as fast as𝕊\\mathbb\{S\}\-FLM\. Observe that FLM is slower than𝕊\\mathbb\{S\}\-FLM\. This is expected since FLM materializes dense one\-hot\-diffused arrays, and multiplies them with the embedding table, which is costly\.

## Appendix CAdditional Results

### C\.1Sequence Length Using Different Tokenizers

![Refer to caption](https://arxiv.org/html/2605.11125v1/x8.png)Figure 5:Distribution of tokenized sequence lengths on TinyGSM, under the GPT\-2 tokenizer \(left\) and theSmolLM\-135Mtokenizer \(right\)\. The SmolLM tokenizer yields shorter sequences on average \(median 204 vs\. 180 tokens\), which allows us to fit more problems in a fixed context length of 512 tokens\.In Figure[5](https://arxiv.org/html/2605.11125#A3.F5), we plot the histogram of length of the tokenized examples in TinyGSM\[[48](https://arxiv.org/html/2605.11125#bib.bib48)\], when tokenized with the GPT\-2 tokenizer\[[71](https://arxiv.org/html/2605.11125#bib.bib71)\]or the SmolLM tokenizer\[[2](https://arxiv.org/html/2605.11125#bib.bib2)\]\. Since the SmolLM tokenizer was trained on code, unlike GPT\-2, we see that it offers better compression\. Therefore, for the experiments on TinyGSM, we use the SmolLM tokenizer\.

### C\.2Analysis under the Random Codebook Model

To derive \([16](https://arxiv.org/html/2605.11125#S3.E16)\), we use the standard concentration bound for inner products of uniform random vectors on the sphere\[[91](https://arxiv.org/html/2605.11125#bib.bib91), Theorem 3\.4\.5\]\. For𝐗,𝐘∈𝕊d−1\\mathbf\{X\},\\mathbf\{Y\}\\in\\mathbb\{S\}^\{d\-1\}with at least one of them sampled from𝒰\(𝕊d−1\)\\mathcal\{U\}\(\\mathbb\{S\}^\{d\-1\}\)and the other either fixed or independently sampled, for everyϵ\>0\\epsilon\>0:

ℙ\(𝐗⊤𝐘\>ϵ\)≤2exp⁡\(−dϵ22\)\.\\mathbb\{P\}\\\!\\left\(\\mathbf\{X\}^\{\\top\}\\mathbf\{Y\}\>\\epsilon\\right\)\\leq 2\\exp\\\!\\left\(\-\\frac\{d\\epsilon^\{2\}\}\{2\}\\right\)\.\(33\)The probability decreases exponentially in the dimensiondd\.

#### Closed\-form for the sampling dynamics

Let𝐳α=SLERP\(𝐳0,𝐞^k,α\)\\mathbf\{z\}\_\{\\alpha\}=\\mathrm\{SLERP\}\(\\mathbf\{z\}\_\{0\},\\hat\{\\mathbf\{e\}\}\_\{k\},\\alpha\)forα∈\[0,1\]\\alpha\\in\[0,1\]\. Then, by definition of SLERP, we have

⟨𝐳α,𝐞^k⟩=cos⁡\(ω\(1−α\)\),\\langle\\mathbf\{z\}\_\{\\alpha\},\\hat\{\\mathbf\{e\}\}\_\{k\}\\rangle=\\cos\\\!\\bigl\(\\omega\(1\-\\alpha\)\\bigr\),\(34\)
whereω\\omegais the angle between𝐳0\\mathbf\{z\}\_\{0\}and𝐞^k\\hat\{\\mathbf\{e\}\}\_\{k\}\. Indeed, we recall the standard trigonometric identities

sin⁡\(a−b\)=sin⁡\(a\)cos⁡\(b\)−cos⁡\(a\)sin⁡\(b\)\\sin\(a\-b\)=\\sin\(a\)\\cos\(b\)\-\\cos\(a\)\\sin\(b\)\(35\)and

cos⁡\(a−b\)=cos⁡\(a\)cos⁡\(b\)\+sin⁡\(a\)sin⁡\(b\)\.\\cos\(a\-b\)=\\cos\(a\)\\cos\(b\)\+\\sin\(a\)\\sin\(b\)\.\(36\)By the definition of SLERP in \([8](https://arxiv.org/html/2605.11125#S2.E8)\),

𝐳α=sin⁡\(\(1−α\)ω\)sin⁡ω𝐳0\+sin⁡\(αω\)sin⁡ω𝐞^k\.\\mathbf\{z\}\_\{\\alpha\}=\\frac\{\\sin\(\(1\-\\alpha\)\\omega\)\}\{\\sin\\omega\}\\,\\mathbf\{z\}\_\{0\}\+\\frac\{\\sin\(\\alpha\\omega\)\}\{\\sin\\omega\}\\,\\hat\{\\mathbf\{e\}\}\_\{k\}\.\(37\)Therefore,

⟨𝐳α,𝐞^k⟩\\displaystyle\\langle\\mathbf\{z\}\_\{\\alpha\},\\hat\{\\mathbf\{e\}\}\_\{k\}\\rangle=sin⁡\(\(1−α\)ω\)sin⁡ω⟨𝐳0,𝐞^k⟩\+sin⁡\(αω\)sin⁡ω\\displaystyle=\\frac\{\\sin\(\(1\-\\alpha\)\\omega\)\}\{\\sin\\omega\}\\,\\langle\\mathbf\{z\}\_\{0\},\\hat\{\\mathbf\{e\}\}\_\{k\}\\rangle\+\\frac\{\\sin\(\\alpha\\omega\)\}\{\\sin\\omega\}\(38\)=1sin⁡ω\[sin⁡\(\(1−α\)ω\)cos⁡ω\+sin⁡\(αω\)\]\\displaystyle=\\frac\{1\}\{\\sin\\omega\}\\Big\[\\sin\(\(1\-\\alpha\)\\omega\)\\cos\\omega\+\\sin\(\\alpha\\omega\)\\Big\]=1sin⁡ω\[\(sin⁡ωcos⁡\(αω\)−cos⁡ωsin⁡\(αω\)\)cos⁡ω\+sin⁡\(αω\)\]\\displaystyle=\\frac\{1\}\{\\sin\\omega\}\\Big\[\\big\(\\sin\\omega\\cos\(\\alpha\\omega\)\-\\cos\\omega\\sin\(\\alpha\\omega\)\\big\)\\cos\\omega\+\\sin\(\\alpha\\omega\)\\Big\]=1sin⁡ω\[sin⁡ωcos⁡ωcos⁡\(αω\)−cos2⁡ωsin⁡\(αω\)\+sin⁡\(αω\)\]\\displaystyle=\\frac\{1\}\{\\sin\\omega\}\\Big\[\\sin\\omega\\cos\\omega\\cos\(\\alpha\\omega\)\-\\cos^\{2\}\\omega\\sin\(\\alpha\\omega\)\+\\sin\(\\alpha\\omega\)\\Big\]=1sin⁡ω\[sin⁡ωcos⁡ωcos⁡\(αω\)\+\(1−cos2⁡ω\)sin⁡\(αω\)\]\\displaystyle=\\frac\{1\}\{\\sin\\omega\}\\Big\[\\sin\\omega\\cos\\omega\\cos\(\\alpha\\omega\)\+\(1\-\\cos^\{2\}\\omega\)\\sin\(\\alpha\\omega\)\\Big\]=1sin⁡ω\[sin⁡ωcos⁡ωcos⁡\(αω\)\+sin2⁡ωsin⁡\(αω\)\]\\displaystyle=\\frac\{1\}\{\\sin\\omega\}\\Big\[\\sin\\omega\\cos\\omega\\cos\(\\alpha\\omega\)\+\\sin^\{2\}\\omega\\sin\(\\alpha\\omega\)\\Big\]=cos⁡ωcos⁡\(αω\)\+sin⁡ωsin⁡\(αω\)\\displaystyle=\\cos\\omega\\cos\(\\alpha\\omega\)\+\\sin\\omega\\sin\(\\alpha\\omega\)=cos⁡\(ω−αω\)=cos⁡\(ω\(1−α\)\),\\displaystyle=\\cos\(\\omega\-\\alpha\\omega\)=\\cos\\\!\\bigl\(\\omega\(1\-\\alpha\)\\bigr\),which proves \([34](https://arxiv.org/html/2605.11125#A3.E34)\)\. We can now derive \([16](https://arxiv.org/html/2605.11125#S3.E16)\)\.

#### Approximation 1

Since𝐳0\\mathbf\{z\}\_\{0\}and𝐞^k\\hat\{\\mathbf\{e\}\}\_\{k\}are independent random points on𝕊d−1\\mathbb\{S\}^\{d\-1\}, \([33](https://arxiv.org/html/2605.11125#A3.E33)\) implies that they are close to orthogonal with high probability in high dimension\. We therefore approximate the initial angle by

ω≈π2\.\\omega\\approx\\frac\{\\pi\}\{2\}\.\(39\)Substituting this into \([34](https://arxiv.org/html/2605.11125#A3.E34)\) yields

⟨𝐳α,𝐞^k⟩≈cos⁡\(π2\(1−α\)\)=sin⁡\(π2α\)\.\\langle\\mathbf\{z\}\_\{\\alpha\},\\hat\{\\mathbf\{e\}\}\_\{k\}\\rangle\\approx\\cos\\\!\\left\(\\frac\{\\pi\}\{2\}\(1\-\\alpha\)\\right\)=\\sin\\\!\\left\(\\frac\{\\pi\}\{2\}\\alpha\\right\)\.\(40\)

#### Approximation 2

Now define

Mα:=maxj≠k⁡⟨𝐳α,𝐞^j⟩,M\_\{\\alpha\}:=\\max\_\{j\\neq k\}\\langle\\mathbf\{z\}\_\{\\alpha\},\\hat\{\\mathbf\{e\}\}\_\{j\}\\rangle,\(41\)that is, the maximum similarity among codewords that the simplified generation dynamics does not interpolate toward\. Using a union bound together with \([33](https://arxiv.org/html/2605.11125#A3.E33)\), we obtain

ℙ\(Mα≥t∣𝐳α\)≤∑j≠kℙ\(⟨𝐳α,𝐞^j⟩\>t∣𝐳α\)≤2\(\|𝒱\|−1\)exp⁡\(−dt22\)\.\\mathbb\{P\}\\\!\\left\(M\_\{\\alpha\}\\geq t\\mid\\mathbf\{z\}\_\{\\alpha\}\\right\)\\leq\\sum\_\{j\\neq k\}\\mathbb\{P\}\\\!\\left\(\\langle\\mathbf\{z\}\_\{\\alpha\},\\hat\{\\mathbf\{e\}\}\_\{j\}\\rangle\>t\\mid\\mathbf\{z\}\_\{\\alpha\}\\right\)\\leq 2\(\|\\mathcal\{V\}\|\-1\)\\exp\\\!\\left\(\-\\frac\{dt^\{2\}\}\{2\}\\right\)\.\(42\)
We now solve for the thresholdt\|𝒱\|,d,δt\_\{\|\\mathcal\{V\}\|,d,\\delta\}such that, with probability at least1−δ1\-\\delta, we haveMα<t\|𝒱\|,d,δM\_\{\\alpha\}<t\_\{\|\\mathcal\{V\}\|,d,\\delta\}\. Concretely, we solve

δ=2\(\|𝒱\|−1\)exp⁡\(−dt\|𝒱\|,d,δ22\),\\delta=2\(\|\\mathcal\{V\}\|\-1\)\\exp\\\!\\left\(\-\\frac\{d\\,t\_\{\|\\mathcal\{V\}\|,d,\\delta\}^\{2\}\}\{2\}\\right\),\(43\)which gives

t\|𝒱\|,d,δ=2log⁡\(2\(\|𝒱\|−1\)/δ\)d\.t\_\{\|\\mathcal\{V\}\|,d,\\delta\}=\\sqrt\{\\frac\{2\\log\\\!\\bigl\(2\(\|\\mathcal\{V\}\|\-1\)/\\delta\\bigr\)\}\{d\}\}\.\(44\)
We can now conclude by definingα⋆\(δ\)\\alpha^\{\\star\}\(\\delta\)as the interpolation level after which𝐳α\\mathbf\{z\}\_\{\\alpha\}is most similar to𝐞^k\\hat\{\\mathbf\{e\}\}\_\{k\}with probability at least1−δ1\-\\delta\. A sufficient condition is

cos⁡\(\(1−α⋆\)ω\)\\displaystyle\\cos\\\!\\bigl\(\(1\-\\alpha^\{\\star\}\)\\omega\\bigr\)≈sin⁡\(π2α⋆\)≥t\|𝒱\|,d,δ\\displaystyle\\approx\\sin\\\!\\left\(\\frac\{\\pi\}\{2\}\\alpha^\{\\star\}\\right\)\\geq t\_\{\|\\mathcal\{V\}\|,d,\\delta\}\(45\)⟺sin⁡\(π2α⋆\)≥2log⁡\(2\(\|𝒱\|−1\)/δ\)d\\displaystyle\\Longleftrightarrow\\sin\\\!\\left\(\\frac\{\\pi\}\{2\}\\alpha^\{\\star\}\\right\)\\geq\\sqrt\{\\frac\{2\\log\\\!\\bigl\(2\(\|\\mathcal\{V\}\|\-1\)/\\delta\\bigr\)\}\{d\}\}⟺α⋆\(δ\)≈2πarcsin⁡\(2log⁡\(2\(\|𝒱\|−1\)/δ\)d\)\.\\displaystyle\\Longleftrightarrow\\alpha^\{\\star\}\(\\delta\)\\approx\\frac\{2\}\{\\pi\}\\arcsin\\\!\\left\(\\sqrt\{\\frac\{2\\log\\\!\\bigl\(2\(\|\\mathcal\{V\}\|\-1\)/\\delta\\bigr\)\}\{d\}\}\\right\)\.This concludes the argument\.

#### Numerical evaluation of the critical threshold

Table[4](https://arxiv.org/html/2605.11125#A3.T4)shows the critical threshold for various vocabulary size\|𝒱\|\|\\mathcal\{V\}\|, embedding dimensiondd, and confidence parameterδ\\delta\. As expected, the threshold*decreases*as the embedding dimension increases, which means that the minimum noise level used during training should increase to counteract the increasing sparsity of𝕊d−1\\mathbb\{S\}^\{d\-1\}\. Furthermore, reducingδ\\deltacorresponds to increasing the confidence level for similarity to the clean token being the largest, hence it means the minimal noise level is*smaller*than with a larger value ofδ\\delta\.

Table 4:α⋆\(δ\)\\alpha^\{\\star\}\(\\delta\)for different\|𝒱\|\|\\mathcal\{V\}\|,dd, andδ\\delta\.\|𝒱\|=12\|\\mathcal\{V\}\|=12

\|𝒱\|=50,000\|\\mathcal\{V\}\|=50,000

\|𝒱\|=100,000\|\\mathcal\{V\}\|=100,000

### C\.3Sudoku Setup

We format the input as a 1D sequence, starting with the partial grid where the missing digits are encoded with0\. We append a \[BOS\] separator and the solution, and separate rows with \[SEP\] \. This representation allows us to train both AR and diffusion models with the same format with a context length of 180\. Our vocabulary contains 12 tokens, except for MDLM, which additionally uses a \[mask\] token\. When training the diffusion models, the partial grid is never corrupted, and we never penalize the predictions on those positions\.

We train all models with Adam for 20k steps, batch size 256, and a learning rate of3×10−43\\times 10^\{\-4\}and an*Exponential Moving Average*\(EMA\) rate of 0\.9999\. For𝕊\\mathbb\{S\}\-FLM, we consider the linear and cos2schedules and truncate following \([16](https://arxiv.org/html/2605.11125#S3.E16)\) \(value in Table[4](https://arxiv.org/html/2605.11125#A3.T4)\)\.

### C\.4Additional Ablations on Sudoku

Table[6](https://arxiv.org/html/2605.11125#A3.T6)ablates two design choices on Sudoku, the noise schedule and the embedding renormalization step\. We define the noise schedules in Table[5](https://arxiv.org/html/2605.11125#A3.T5)\.

Table 5:Analytical noise schedules used on Sudoku, in the conventionα0=0\\alpha\_\{0\}=0,α1=1\\alpha\_\{1\}=1\.#### Effect of the noise schedule

Truncating toα⋆\(0\.1\)\\alpha^\{\\star\}\(0\.1\)improves accuracy on both Sudoku \(Table[1](https://arxiv.org/html/2605.11125#S4.T1)\) and TinyGSM \(Table[2](https://arxiv.org/html/2605.11125#S4.T2)\)\. On Sudoku, the gains are largest for the linear schedule \(Table[6](https://arxiv.org/html/2605.11125#A3.T6)\)\. Cosine2also benefits from truncation, but by a smaller margin\. Without truncation, Cosine2outperforms the linear schedule\.

#### Effect of embedding renormalization

Training without renormalizing the embeddings after each step improves accuracy across nearly all schedules and difficulties \(Table[6](https://arxiv.org/html/2605.11125#A3.T6)\)\. The gap is largest at the hard difficulty, reaching 7\.3 points on Cosine2without truncation\.

Table 6:Effect of embedding renormalization on𝕊\\mathbb\{S\}\-FLM\. Each row compares the same config without vs\. with embedding re\-normalization after every optimizer update\. Weboldthe best result per row and difficulty\. Training without re\-normalization after each step leads to stronger performance\.As discussed in Suppl\.[B\.5](https://arxiv.org/html/2605.11125#A2.SS5), the gradient with respect to a normalized embedding is tangent to the sphere, so each gradient step increases the norm of the embeddings\. This is equivalent to a learning\-rate decay on the embedding table\. Empirically, such annealing improves performance on Sudoku \(Table[6](https://arxiv.org/html/2605.11125#A3.T6)\)\.

### C\.5Bootstrap Confidence Intervals on TinyGSM

The error bars on the TinyGSM accuracy plots are95%95\\%percentile bootstrap confidence intervals\[[24](https://arxiv.org/html/2605.11125#bib.bib24),[72](https://arxiv.org/html/2605.11125#bib.bib72)\]\. The GSM8K test split containsN=1,319N=1\{,\}319problems, each with a generated solution that is either correct or incorrect\. We resample theNNoutcomes with replacementB=1,000B=1\{,\}000times and recompute the accuracy on every resample\. The point estimate is the mean of theBBaccuracies\. The error bar runs from the2\.5%2\.5\\%quantile of theBBresample accuracies to the97\.5%97\.5\\%quantile, namely the2\.52\.5th and97\.597\.5th percentiles of the empirical bootstrap accuracy distribution\.

### C\.6Additional Ablations on TinyGSM

Table[7](https://arxiv.org/html/2605.11125#A3.T7)sweeps the truncation hyperparameterαtrunc\\alpha\_\{\\text\{trunc\}\}on the linear schedule, with the standard DiT backbone trained for 250k steps\. Truncating the noise schedule improves the accuracy from1\.21%1\.21\\%atαtrunc=0\\alpha\_\{\\text\{trunc\}\}=0to7\.73%7\.73\\%atαtrunc=0\.9\\alpha\_\{\\text\{trunc\}\}=0\.9\. The thresholds from \([16](https://arxiv.org/html/2605.11125#S3.E16)\) ford=768d=768and\|𝒱\|≈50k\|\\mathcal\{V\}\|\\approx 50\\text\{k\},α⋆\(0\.1\)=0\.879\\alpha^\{\\star\}\(0\.1\)=0\.879andα⋆\(0\.01\)=0\.869\\alpha^\{\\star\}\(0\.01\)=0\.869, perform similarly to the best run \(αtrunc=0\.9\\alpha\_\{\\text\{trunc\}\}=0\.9\)\.

Table 7:Effect of the truncation hyperparameterαtrunc\\alpha\_\{\\text\{trunc\}\}on𝕊\\mathbb\{S\}\-FLM accuracy on GSM8K \(trained on TinyGSM\)\. Linear noise schedule, standard DiT backbone, 250k steps\. The values0\.8790\.879and0\.8690\.869are the thresholdsα⋆\(δ\)\\alpha^\{\\star\}\(\\delta\)from \([16](https://arxiv.org/html/2605.11125#S3.E16)\) atδ=0\.1\\delta=0\.1andδ=0\.01\\delta=0\.01respectively\.Boldmarks the best result\.
### C\.7Additional Sampling Results on TinyGSM

For models trained on TinyGSM, we sweep architecture, sampling temperature, and velocity decoding, and report GSM8K accuracy\. Figure[6](https://arxiv.org/html/2605.11125#A3.F6)replots Figure[1](https://arxiv.org/html/2605.11125#S0.F1)\(right\)\. AtT=0\.1T=0\.1, Duo reaches36\.0%36\.0\\%and MDLM about33%33\\%, while𝕊\\mathbb\{S\}\-FLM plateaus near18%18\\%\. Figure[7](https://arxiv.org/html/2605.11125#A3.F7)compares the𝕊\\mathbb\{S\}\-arch and standard DiT backbones at both temperatures, withT=0\.1T=0\.1improving GSM8K accuracy by about six points for both\. Figure[8](https://arxiv.org/html/2605.11125#A3.F8)sweeps the top\-kktruncation of velocity decoding for𝕊\\mathbb\{S\}\-arch atT=1T=1\. Top\-11reaches18\.0%18\.0\\%, whilek≥10k\\geq 10plateau near12%12\\%, matching unrestricted decoding\.

![Refer to caption](https://arxiv.org/html/2605.11125v1/x9.png)Figure 6:GSM8K accuracy vs\. NFE atT=1T=1for𝕊\\mathbb\{S\}\-FLM \(𝕊\\mathbb\{S\}\-arch and standard DiT\), MDLM, and Duo\. Same data as Figure[1](https://arxiv.org/html/2605.11125#S0.F1)\(right\)\.![Refer to caption](https://arxiv.org/html/2605.11125v1/x10.png)Figure 7:GSM8K accuracy vs\. NFE under exact decoding for𝕊\\mathbb\{S\}\-FLM with𝕊\\mathbb\{S\}\-arch \(Norm\.\) and standard DiT \(Std\.\), atT=1T=1\(grey\) andT=0\.1T=0\.1\(orange\)\. Lowering the temperature improves accuracy by about six points for both backbones\.![Refer to caption](https://arxiv.org/html/2605.11125v1/x11.png)Figure 8:GSM8K accuracy vs\. NFE for𝕊\\mathbb\{S\}\-FLM \(𝕊\\mathbb\{S\}\-arch\) atT=1T=1, sweeping the top\-kktruncation of the predicted velocity field at each Euler step\. Top\-11reaches18\.0%18\.0\\%, whilek≥10k\\geq 10all plateau near12%12\\%, matching unrestricted decoding\.![Refer to caption](https://arxiv.org/html/2605.11125v1/x12.png)
![Refer to caption](https://arxiv.org/html/2605.11125v1/x13.png)

Figure 9:GSM8K accuracy vs\. NFE for𝕊\\mathbb\{S\}\-FLM \(𝕊\\mathbb\{S\}\-arch\) under exact velocity, stochastic, top\-11, and top\-1010decoding\.\(left\)Sampling temperatureT=1T=1\. Exact velocity and stochastic decoding plateau near12%12\\%, while top\-11reaches18\.0%18\.0\\%\.\(right\)Sampling temperatureT=0\.1T=0\.1\. All four schemes plateau within one point of18\.0%18\.0\\%\. Top\-11decoding \(T=1T=1\) outperforms low\-temperature stochastic decoding\.
### C\.8Gen\. PPL / Entropy Frontier Protocol

We adopt the same setup as as PGM\[[20](https://arxiv.org/html/2605.11125#bib.bib20)\]\. For each method \(MDLM, Duo, FLM, and𝕊\\mathbb\{S\}\-FLM with exact, stochastic, andk=1k=1velocity decoding\), we sweep the sampling temperatureT∈\{0\.50,0\.55,…,1\.20\}T\\in\\\{0\.50,0\.55,\\ldots,1\.20\\\}\(1515evenly\-spaced values\) and the number of function evaluationsNFE∈\{32,64,128,256,512,1024\}\\mathrm\{NFE\}\\in\\\{32,64,128,256,512,1024\\\}\. For every\(T,NFE\)\(T,\\mathrm\{NFE\}\)cell, we drawN=512N=512samples of lengthL=1024L=1024from the OpenWebText\-trained model and compute two scalars per sample\.

#### Generative Perplexity

We score each generated sample with a pretrained GPT\-2\-large reference model\[[20](https://arxiv.org/html/2605.11125#bib.bib20)\]and report the per\-sample average

PPLgen=exp⁡\(−1L∑i=1Llog⁡pGPT2\(xi∣x<i\)\),\\mathrm\{PPL\}\_\{\\mathrm\{gen\}\}=\\exp\\\!\\left\(\-\\frac\{1\}\{L\}\\sum\_\{i=1\}^\{L\}\\log p\_\{\\mathrm\{GPT2\}\}\\\!\\left\(x\_\{i\}\\mid x\_\{<i\}\\right\)\\right\),\(46\)masking tokens at or after the first end\-of\-text\. We average across theNNsamples in the cell\.

#### Unigram entropy

To flag degenerate \(repetitive\) outputs, we compute the per\-cell mean unigram entropy\[[20](https://arxiv.org/html/2605.11125#bib.bib20)\]

H=−1N∑n=1N∑v∈𝒱c\(v,x\(n\)\)Llog⁡c\(v,x\(n\)\)L,H=\-\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\sum\_\{v\\in\\mathcal\{V\}\}\\frac\{c\(v,x^\{\(n\)\}\)\}\{L\}\\log\\frac\{c\(v,x^\{\(n\)\}\)\}\{L\},\(47\)wherec\(v,x\(n\)\)c\(v,x^\{\(n\)\}\)counts the occurrences of tokenvvin samplex\(n\)x^\{\(n\)\}\.

#### Visible window

Each cell contributes one\(H,PPLgen\)\(H,\\mathrm\{PPL\}\_\{\\mathrm\{gen\}\}\)point\. We restrict the visible window to4\.5≤H≤6\.04\.5\\leq H\\leq 6\.0nats andPPLgen≤500\\mathrm\{PPL\}\_\{\\mathrm\{gen\}\}\\leq 500to drop collapsed and degenerate configurations\.𝕊\\mathbb\{S\}\-FLM withk=1k=1velocity does not depend on the temperature, so it appears as a single marker rather than a curve\.

### C\.9Additional Results on OpenWebText

Figure[10](https://arxiv.org/html/2605.11125#A3.F10)shoes the Gen\. PPL / Entropy front for NFEs∈\{32,64,128,256,512,1024\}\\in\\\{32,64,128,256,512,1024\\\}\.

![Refer to caption](https://arxiv.org/html/2605.11125v1/x14.png)
![Refer to caption](https://arxiv.org/html/2605.11125v1/x15.png)
![Refer to caption](https://arxiv.org/html/2605.11125v1/x16.png)
![Refer to caption](https://arxiv.org/html/2605.11125v1/x17.png)
![Refer to caption](https://arxiv.org/html/2605.11125v1/x18.png)
![Refer to caption](https://arxiv.org/html/2605.11125v1/x19.png)

Figure 10:OpenWebText Gen\. PPL versus per\-sample unigram entropy for NFE∈\{32,64,128,256,512,1024\}\\in\\\{32,64,128,256,512,1024\\\}\. Each curve sweeps a sampling\-temperature schedule\.𝕊\\mathbb\{S\}\-FLM \(exact velocity and stochastic decoding\) traces the Pareto frontier across every NFE budget, with the largest margin over MDLM, Duo, and FLM at low NFE\.

## Appendix DExtended Related Work

𝕊\\mathbb\{S\}\-FLM differs from prior work in three main ways\. \(1\) It operates in continuous rather than discrete space, \(2\) defines the flow on the hypersphere rather than in Euclidean space, and \(3\) learns embeddings end\-to\-end instead of relying on pre\-trained representations\.

#### Discrete diffusion language models

Discrete diffusion models for text\[[5](https://arxiv.org/html/2605.11125#bib.bib5),[9](https://arxiv.org/html/2605.11125#bib.bib9),[52](https://arxiv.org/html/2605.11125#bib.bib52),[76](https://arxiv.org/html/2605.11125#bib.bib76),[28](https://arxiv.org/html/2605.11125#bib.bib28),[10](https://arxiv.org/html/2605.11125#bib.bib10),[77](https://arxiv.org/html/2605.11125#bib.bib77),[19](https://arxiv.org/html/2605.11125#bib.bib19),[82](https://arxiv.org/html/2605.11125#bib.bib82),[61](https://arxiv.org/html/2605.11125#bib.bib61),[79](https://arxiv.org/html/2605.11125#bib.bib79),[92](https://arxiv.org/html/2605.11125#bib.bib92),[20](https://arxiv.org/html/2605.11125#bib.bib20),[93](https://arxiv.org/html/2605.11125#bib.bib93),[78](https://arxiv.org/html/2605.11125#bib.bib78),[4](https://arxiv.org/html/2605.11125#bib.bib4)\]approach AR in Gen\. PPL while enabling parallel generation, but require updating tokens independently at inference for tractability\. Furthermore, because the sampling steps are not differentiable, gradient\-based guidance\[[21](https://arxiv.org/html/2605.11125#bib.bib21),[36](https://arxiv.org/html/2605.11125#bib.bib36),[62](https://arxiv.org/html/2605.11125#bib.bib62),[80](https://arxiv.org/html/2605.11125#bib.bib80)\]is harder than in the continuous case\.

#### Continuous diffusion for language modeling

Instead of operating in the discrete space of tokens, certain prior work use Gaussian diffusion\[[84](https://arxiv.org/html/2605.11125#bib.bib84),[35](https://arxiv.org/html/2605.11125#bib.bib35)\]for language modeling\. One approach is to apply Gaussian diffusion to embeddings and train end\-to\-end with CE\[[46](https://arxiv.org/html/2605.11125#bib.bib46),[22](https://arxiv.org/html/2605.11125#bib.bib22),[33](https://arxiv.org/html/2605.11125#bib.bib33)\]\. Alternatively, others regress onto pre\-trained embeddings\[[86](https://arxiv.org/html/2605.11125#bib.bib86),[54](https://arxiv.org/html/2605.11125#bib.bib54),[81](https://arxiv.org/html/2605.11125#bib.bib81)\]\. The latter requires two training stages, and the sample quality is capped by the pre\-trained embeddings\. Instead of embeddings, recent work add Gaussian noise to one\-hot or simplex representations of the tokens\[[13](https://arxiv.org/html/2605.11125#bib.bib13),[34](https://arxiv.org/html/2605.11125#bib.bib34),[55](https://arxiv.org/html/2605.11125#bib.bib55),[85](https://arxiv.org/html/2605.11125#bib.bib85),[45](https://arxiv.org/html/2605.11125#bib.bib45),[75](https://arxiv.org/html/2605.11125#bib.bib75),[68](https://arxiv.org/html/2605.11125#bib.bib68)\]\. This approach requires storing dense arraysL×\|𝒱\|L\\times\|\\mathcal\{V\}\|during training and sampling\. Separately, Loopholing discrete diffusion\[[38](https://arxiv.org/html/2605.11125#bib.bib38)\]allows discrete diffusion models to propagate the latent representation of the denoiser across sampling steps\.𝕊\\mathbb\{S\}\-FLM is defined on the latent hypersphere\. Therefore, it does not require the materialization of dense arraysL×\|𝒱\|L\\times\|\\mathcal\{V\}\|during training\. We inject noise by rotating embeddings rather than adding Gaussian noise\.

#### Representation learning on the hypersphere

Hyperspherical representations are common in contrastive learning, where uniform spread in𝕊d−1\\mathbb\{S\}^\{d\-1\}correlates with strong downstream performance\[[94](https://arxiv.org/html/2605.11125#bib.bib94)\]\. Comparing word embeddings with cosine similarity performs better than using the Euclidean distance in high dimension\[[58](https://arxiv.org/html/2605.11125#bib.bib58),[67](https://arxiv.org/html/2605.11125#bib.bib67)\]\. Thus, cosine similarity underpins a large part of neural retrieval systems\[[74](https://arxiv.org/html/2605.11125#bib.bib74),[40](https://arxiv.org/html/2605.11125#bib.bib40)\]\. Training Variational Autoencoders with a latent prior in𝕊d−1\\mathbb\{S\}^\{d\-1\}can be more stable than with Gaussian priors\[[16](https://arxiv.org/html/2605.11125#bib.bib16),[96](https://arxiv.org/html/2605.11125#bib.bib96)\]\. Normalizing activations and weights to𝕊d−1\\mathbb\{S\}^\{d\-1\}can also improve the stability of AR models\[[50](https://arxiv.org/html/2605.11125#bib.bib50)\]\. In Riemannian manifolds, predicting the clean endpoint can outperform regressing the velocity field\[[25](https://arxiv.org/html/2605.11125#bib.bib25),[97](https://arxiv.org/html/2605.11125#bib.bib97)\]\. Previous work extended CNFs and score\-based diffusion to Riemannian Manifolds\[[51](https://arxiv.org/html/2605.11125#bib.bib51),[56](https://arxiv.org/html/2605.11125#bib.bib56),[8](https://arxiv.org/html/2605.11125#bib.bib8),[53](https://arxiv.org/html/2605.11125#bib.bib53)\]\. However, they typically assume data already reside on the manifold\. In contrast, we learn the token representation and the velocity on𝕊d−1\\mathbb\{S\}^\{d\-1\}jointly\. Fisher\-Flow\[[17](https://arxiv.org/html/2605.11125#bib.bib17)\]maps one\-hot vector on the positive orthant of the hypersphere, but this representation does not perform well on language modeling with large vocabularies\[[37](https://arxiv.org/html/2605.11125#bib.bib37)\]\.
Language Modeling with Hyperspherical Flows

Similar Articles

FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

Masked Language Flow Models

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Self-conditioned Flow Map Language Models via Fixed-point Flows

Surflo: Consistent 3D Surface Flow Model with Global State

Submit Feedback

Similar Articles

FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
Self-conditioned Flow Map Language Models via Fixed-point Flows
Surflo: Consistent 3D Surface Flow Model with Global State