Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

arXiv cs.CL 05/15/26, 04:00 AM Papers
diffusion-models language-generation optimal-control flow-matching llm parallel-sampling text-generation
Summary
This paper reformulates language generation as a stochastic optimal control problem, addressing limitations of autoregressive and diffusion models, and proposes a closed-loop diffusion method in latent control space using Flow Matching, achieving high-fidelity generation and efficient parallel sampling.
arXiv:2605.14531v1 Announce Type: new Abstract: This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.
Original Article
View Cached Full Text
Cached at: 05/15/26, 06:23 AM
# Closed-Loop Diffusion in Latent Control Space
Source: [https://arxiv.org/html/2605.14531](https://arxiv.org/html/2605.14531)
## Language Generation as Optimal Control: Closed\-Loop Diffusion in Latent Control Space

###### Abstract

This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations \(Efficiency\-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity\) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence\. To address these issues, we approximate the solution to the Hamilton\-Jacobi\-Bellman \(HJB\) equation, yielding an optimal policy that acts as a closed\-loop controller\. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space\. This allows our Manta\-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high\-fidelity text generation and efficient, low\-cost parallel sampling\. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability\.

Machine Learning, ICML

## 1Introduction

The paradigm of Large Language Models \(LLMs\) has been dominated by Autoregressive Models \(ARMs\) in a sequential, “next\-token” prediction manner\(Brown et al\.,[2020](https://arxiv.org/html/2605.14531#bib.bib4); Touvron et al\.,[2023](https://arxiv.org/html/2605.14531#bib.bib43)\), demonstrating remarkable scaling properties and emergent reasoning capabilities\. The emerging Diffusion Language Models \(DLMs\) break serial constraints as a promising competitor to ARMs and promise a global receptive field and parallelizable sampling for generation\. Discrete DLMs,*e\.g\.*, D3PM\(Austin et al\.,[2021](https://arxiv.org/html/2605.14531#bib.bib1)\), MDLM\(Sahoo et al\.,[2024](https://arxiv.org/html/2605.14531#bib.bib37)\), and LLaDA\(Nie et al\.,[2025a](https://arxiv.org/html/2605.14531#bib.bib32)\), define a forward corruption process via transition matrices,*e\.g\.*, masking, within a discrete token space, and subsequently train a model by predicting the original tokens or the denoising trajectory\. Continuous DLMs,*e\.g\.*, CDCD\(Dieleman et al\.,[2022](https://arxiv.org/html/2605.14531#bib.bib12)\), and RDLM\(Jo & Hwang,[2025](https://arxiv.org/html/2605.14531#bib.bib23)\), project discrete tokens into a continuous embedding space to apply standard Gaussian diffusion frameworks, training the model to denoise high\-dimensional vectors that are ultimately mapped back to discrete vocabulary\.

Despite great progress, several crucial issues challenge those generative language paradigms, critically defining the current technological ceiling of LLMs\.*1\) Efficiency\-Fidelity Paradox:*ARMs are architecturally tethered to serial decoding, creating a linear computational bottleneck \(O\(N\)O\(N\)\)\. While DLMs theoretically allow for parallel refinement, their reliance on heuristic Gaussian denoising necessitates hundreds of resampling steps to converge, merely trading one efficiency bottleneck for another\.*2\) Irreversibility Error Propagation:*ARMs in generation suffer from accumulated errors\. In the open\-loop setting of ARMs, a single ”hallucinated” token in a sequence becomes the fixed prior for all subsequent steps, creating an accumulative and irreversible drift from the optimal trajectory\. Standard DLMs similarly struggle with this situation: without a global guidance mechanism to regulate the denoising flow, the latent state often drifts into low\-density semantic regions\.*3\) Optimization Tractability and Fidelity:*Recent attempts to apply continuous diffusion directly in the token embedding space\(Strudel et al\.,[2022](https://arxiv.org/html/2605.14531#bib.bib40); Dieleman et al\.,[2022](https://arxiv.org/html/2605.14531#bib.bib12)\)face a geometric hurdle\. The raw embedding space is sparse and clustered rather than being a smooth manifold\. This “ill\-conditioned topology” results in high\-curvature generation trajectories and severe quantization errors, where the continuous denoising process struggles to map back to discrete linguistic tokens without losing structural integrity\.

To address those issues, we first reformulate generative language models and revisit those models from a stochastic optimal control perspective\. Then, we proposeManta\-LM, which connects text generation with continuous dynamical systems\. Our approach rests on two theoretical pillars:Manifold RectificationandOptimal Closed\-Loop Control\. First, we employ a regularized Variational Autoencoder \(VAE\) to map the ill\-conditioned discrete space to a compact, locally Euclidean latent manifold\. This rectification reduces topological stiffness and encourages smoother transport trajectories\. Second, within this continuous latent space, we model generation as an*Optimal Control*problem\(Benamou & Brenier,[2000](https://arxiv.org/html/2605.14531#bib.bib2)\)\. By approximating the Hamilton\-Jacobi\-Bellman \(HJB\) equation via*Flow Matching*\(Lipman et al\.,[2023](https://arxiv.org/html/2605.14531#bib.bib27); Bertucci,[2023](https://arxiv.org/html/2605.14531#bib.bib3)\), our model learns a vector field that acts as aClosed\-Loop Feedback Controller\.

## 2Related Works

Autoregressive Language Models\.Autoregressive \(AR\) models have long been the dominant paradigm in language modeling, forming the backbone of modern large language models such as GPT\(Brown et al\.,[2020](https://arxiv.org/html/2605.14531#bib.bib4)\)and LLaMA\(Touvron et al\.,[2023](https://arxiv.org/html/2605.14531#bib.bib43)\)\. By factorizing the joint distribution of a sequence into a product of conditional probabilities, AR models excel at modeling local syntactic dependencies and achieve strong likelihood\-based performance\. However, the AR paradigm fundamentally enforces a strictly sequential generation process, where each token decision is irrevocable once sampled\. This token\-level hard commitment renders inference inherently non\-parallelizable and amplifies error accumulation through exposure bias\. Recent analyses have increasingly recognized these limitations, motivating alternatives that relax strict left\-to\-right decoding in favor of global or iterative refinement strategies\. Our work departs from AR generation by reinterpreting this paradigm through the lens of control theory, identifying AR decoding as a form of greedy open\-loop control operating in a discrete state space, which lacks mechanisms for global trajectory optimization or feedback correction\.

Discrete Diffusion Language Models\.Discrete DLMs extend diffusion\-based generative modeling to categorical data by defining Markovian noising and denoising processes over discrete token spaces\(Sohl\-Dickstein et al\.,[2015](https://arxiv.org/html/2605.14531#bib.bib39); Hoogeboom et al\.,[2021](https://arxiv.org/html/2605.14531#bib.bib20); Austin et al\.,[2021](https://arxiv.org/html/2605.14531#bib.bib1)\)\. Among these, D3PM\(Austin et al\.,[2021](https://arxiv.org/html/2605.14531#bib.bib1)\)establishes a general framework using arbitrary transition matrices, while subsequent works explore masked diffusion as a particularly effective instantiation for language\(Sun et al\.,[2022](https://arxiv.org/html/2605.14531#bib.bib41); Lou et al\.,[2023](https://arxiv.org/html/2605.14531#bib.bib28); Shi et al\.,[2024](https://arxiv.org/html/2605.14531#bib.bib38); Sahoo et al\.,[2024](https://arxiv.org/html/2605.14531#bib.bib37); Ou et al\.,[2024](https://arxiv.org/html/2605.14531#bib.bib34)\)\. Recent advances demonstrate that discrete DLMs can achieve competitive perplexity with autoregressive models at GPT\-2 scale, especially when incorporating absorbing states, score entropy objectives \(SEDD\), or refined masking schedules\(Lou et al\.,[2023](https://arxiv.org/html/2605.14531#bib.bib28); Ou et al\.,[2024](https://arxiv.org/html/2605.14531#bib.bib34); Nie et al\.,[2025b](https://arxiv.org/html/2605.14531#bib.bib33)\)\. Large\-scale efforts further scale masked diffusion to billions of parameters and extend it to multimodal generation\(Gong et al\.,[2024](https://arxiv.org/html/2605.14531#bib.bib17); Ye et al\.,[2025](https://arxiv.org/html/2605.14531#bib.bib46); Swerdlow et al\.,[2025](https://arxiv.org/html/2605.14531#bib.bib42); Yang et al\.,[2025](https://arxiv.org/html/2605.14531#bib.bib45); Li et al\.,[2025b](https://arxiv.org/html/2605.14531#bib.bib25)\)\. Despite these successes, discrete diffusion models inherit fundamental limitations from the non\-metric nature of token spaces\. The absence of a differentiable geometry prevents the definition of meaningful gradients over token trajectories, complicating the application of score matching and optimal transport principles\. Thus, many methods rely on heuristic masking, remasking, or conditional independence assumptions, leading to trade\-offs between generation quality, stability, and efficiency\. Those issues account for the inability of discrete models to perceive or optimize smooth generation trajectories\.

Continuous Diffusion Language Models\.To recover differentiability, several works embed discrete tokens into continuous spaces and apply diffusion processes therein\. Early approaches diffuse word embeddings directly and discretize outputs via nearest\-neighbor or thresholding operations\(Li et al\.,[2022](https://arxiv.org/html/2605.14531#bib.bib26); Dieleman et al\.,[2022](https://arxiv.org/html/2605.14531#bib.bib12); Gong et al\.,[2023](https://arxiv.org/html/2605.14531#bib.bib16),[https://arxiv.org/html/2605.14531#bib.bib15](https://arxiv.org/html/2605.14531#bib.bib15)\)\. While conceptually simple, such methods often suffer from information loss during dequantization and struggle to preserve categorical semantics\. More structured continuous relaxations operate on probability simplices or logit spaces, leveraging Dirichlet priors, simplex geometry, or concrete distributions to impose statistical constraints on the diffusion process\(Han et al\.,[2023](https://arxiv.org/html/2605.14531#bib.bib19); Mahabadi et al\.,[2024](https://arxiv.org/html/2605.14531#bib.bib29)\)\. Flow\-matching and score\-based techniques further interpret the simplex as a statistical manifold, enabling continuous\-time modeling\(Cheng et al\.,[2024](https://arxiv.org/html/2605.14531#bib.bib8)\)\. Nevertheless, these approaches generally underperform discrete diffusion in generation fidelity or incur substantial computational overhead, particularly at scale\(Gulrajani & Hashimoto,[2023](https://arxiv.org/html/2605.14531#bib.bib18)\)\.

## 3Revisiting Generative Language Models from Stochastic Optimal Control

In this section, we first cast generative modeling as aStochastic Optimal Controlproblem\(Fleming & Rishel,[2012](https://arxiv.org/html/2605.14531#bib.bib13)\), and present an elaborated theoretical analysis to examine existing generative language models \(*i\.e\.*, ARMs and DLMs\)\. With stochastic optimal control, we formalize the generation process as the time evolution of a probability density under a vector field and provide theoretical explanations for the limitations of existing methods\.

### 3\.1Stochastic Optimal Control

Stochastic Optimal Control \(SOC\) is a control theory that finds a control law to drive a system’s evolution with minimum cost in the presence of random noise\. In the context of language generation, the generative process as the controlled evolution of a state𝐳t\\mathbf\{z\}\_\{t\}on a manifoldℳ\\mathcal\{M\}over a finite time horizont∈\[0,1\]t\\in\[0,1\]\. The dynamics are governed by the stochastic differential equation\(Benamou & Brenier,[2000](https://arxiv.org/html/2605.14531#bib.bib2)\):

d𝐳t=𝐮\(𝐳t,t\)dt\+σ\(t\)d𝐰t,𝐳0∼pprior,d\\mathbf\{z\}\_\{t\}=\\mathbf\{u\}\(\\mathbf\{z\}\_\{t\},t\)dt\+\\sigma\(t\)d\\mathbf\{w\}\_\{t\},\\quad\\mathbf\{z\}\_\{0\}\\sim p\_\{\\text\{prior\}\},\(1\)where𝐮\(⋅\)\\mathbf\{u\}\(\\cdot\)is thecontrol law\(vector field\) to be learned, andσ\(t\)\\sigma\(t\)modulates the exploration noise\. The objective of the generator is to transport the priorppriorp\_\{\\text\{prior\}\}to the data distributionpdatap\_\{\\text\{data\}\}with minimal effort\. Following the Benamou\-Brenier formulation\(Benamou & Brenier,[2000](https://arxiv.org/html/2605.14531#bib.bib2)\), the optimal control policy𝐮∗\\mathbf\{u\}^\{\*\}is obtained by minimizing the transport cost functionalJ\(𝐮\)J\(\\mathbf\{u\}\):

J\(𝐮\)=𝔼𝐳1∼p1\[−log⁡pθ\(𝐳1\)\]⏟Terminal Cost \(Data Fidelity\)\+λ∫01𝔼𝐳t∼pt\[12‖𝐮t\(𝐳t\)‖2\]⏟Running Cost \(Kinetic Energy\)𝑑t\.J\(\\mathbf\{u\}\)=\\underbrace\{\\mathbb\{E\}\_\{\\mathbf\{z\}\_\{1\}\\sim p\_\{1\}\}\[\-\\log p\_\{\\theta\}\(\\mathbf\{z\}\_\{1\}\)\]\}\_\{\\text\{Terminal Cost \(Data Fidelity\)\}\}\+\\lambda\\int\_\{0\}^\{1\}\\underbrace\{\\mathbb\{E\}\_\{\\mathbf\{z\}\_\{t\}\\sim p\_\{t\}\}\\left\[\\frac\{1\}\{2\}\\\|\\mathbf\{u\}\_\{t\}\(\\mathbf\{z\}\_\{t\}\)\\\|^\{2\}\\right\]\}\_\{\\text\{Running Cost \(Kinetic Energy\)\}\}dt\.\(2\)
According todynamic programming principle, the optimal control law𝐮∗\\mathbf\{u\}^\{\*\}can be characterized by the gradient of the*optimal value function*V\(𝐳,t\)=inf𝐮𝔼\[J\(u\)\]V\(\\mathbf\{z\},t\)=\\inf\_\{\\mathbf\{u\}\}\\mathbb\{E\}\\left\[J\(u\)\\right\]\(the minimum cost\-to\-go\), which satisfiesHamilton\-Jacobi\-Bellman \(HJB\)equation\. The optimal controller is necessarily:

𝐮∗\(𝐳,t\)=−∇𝐳V\(𝐳,t\)\.\\mathbf\{u\}^\{\*\}\(\\mathbf\{z\},t\)=\-\\nabla\_\{\\mathbf\{z\}\}V\(\\mathbf\{z\},t\)\.\(3\)This indicates two critical properties for an ideal control system:*1\) Closed\-Loop Feedback:*The control𝐮∗\\mathbf\{u\}^\{\*\}depends on thecurrent global state𝐳t\\mathbf\{z\}\_\{t\}, using the potential landscape∇𝐳V\\nabla\_\{\\mathbf\{z\}\}Vto correct deviations\.*2\) Smooth Geodesic Flow:*Minimizing the kinetic energy term12‖𝐮‖2\\frac\{1\}\{2\}\\\|\\mathbf\{u\}\\\|^\{2\}mandates that the optimal trajectory follows low\-energy, geodesic\-like trajectories in the Wasserstein space\.

More broadly, this view makes text generation comparable through three control properties: whether future objectives can influence current updates, whether the model has corrective directions toward valid states, and whether the trajectory geometry is sufficiently smooth for stable integration\. These properties are especially relevant for boundary\-conditioned generation, where a model must satisfy global constraints rather than only extend a prefix\.

### 3\.2Generative Language Modeling via SOC

With the optimal controller defined in[Equation3](https://arxiv.org/html/2605.14531#S3.E3), we next place representative language generation paradigms into the same state\-update view\.

##### I\. Autoregressive Models \(AR\): Impulsive Open\-Loop Control\.

Autoregressive generation operates in a limit of zero viscosity \(σ\(t\)→0\\sigma\(t\)\\to 0\), where the state evolution is driven by discrete token updates rather than a continuous flow\. Under SOC, the generative dynamics are modeled as a deterministic system driven by animpulsive control law:

d𝐳t=\[∑k=1N𝐟θ\(𝐳tk,𝐡tk\)⋅δ\(t−tk\)\]⏟Impulsive Control𝐮AR\(𝐳t,t\)dt\+0⏟No Diffusion⋅d𝐰t\.d\\mathbf\{z\}\_\{t\}=\\underbrace\{\\left\[\\sum\_\{k=1\}^\{N\}\\mathbf\{f\}\_\{\\theta\}\(\\mathbf\{z\}\_\{t\_\{k\}\},\\mathbf\{h\}\_\{t\_\{k\}\}\)\\cdot\\delta\(t\-t\_\{k\}\)\\right\]\}\_\{\\text\{Impulsive Control \}\\mathbf\{u\}\_\{\\text\{AR\}\}\(\\mathbf\{z\}\_\{t\},t\)\}dt\+\\underbrace\{0\}\_\{\\text\{No Diffusion\}\}\\cdot d\\mathbf\{w\}\_\{t\}\.\(4\)whereδ\(⋅\)\\delta\(\\cdot\)is Dirac delta function,𝐡tk\\mathbf\{h\}\_\{t\_\{k\}\}is the past hidden state and𝐟θ\\mathbf\{f\}\_\{\\theta\}is the greedy update at stepkk\. Under this continuous\-time interpretation, AR sampling corresponds to asingular trajectory, a piecewise constant path with impulsive updates at transition pointstkt\_\{k\}\. The control update𝐟θ\\mathbf\{f\}\_\{\\theta\}is derived from local likelihood maximization, lacking adjoint state feedback for global optimality\. Thus, under the SOC abstraction, ARMs behave as a stiff, open\-loop control system that is prone to error accumulation when global terminal constraints matter\.

##### II\. Discrete DLMs: Gradient\-Free Rate Control\.

Discrete DLMs operate on a categorical lattice where the state evolution follows a Controlled Continuous\-Time Markov Chain \(CTMC\)\. In our SOC formulation, this is governed by aJump SDEdriven by Poisson counter processesNN:

d𝐳t=∑𝐲∈𝒟,𝐲≠𝐳t\(𝐲−𝐳t\)⋅dNt\(λ=\[𝐮t\(𝐳t\)\]𝐳t→𝐲\)\.d\\mathbf\{z\}\_\{t\}=\\sum\_\{\\mathbf\{y\}\\in\\mathcal\{D\},\\mathbf\{y\}\\neq\\mathbf\{z\}\_\{t\}\}\(\\mathbf\{y\}\-\\mathbf\{z\}\_\{t\}\)\\cdot dN\_\{t\}\\left\(\\lambda=\[\\mathbf\{u\}\_\{t\}\{\\text\{\}\}\(\\mathbf\{z\}\_\{t\}\)\]\_\{\\mathbf\{z\}\_\{t\}\\to\\mathbf\{y\}\}\\right\)\.\(5\)Here, the control law𝐮\\mathbf\{u\}\_\{\\text\{\}\}does not define a velocity vector, but rather aTransition Rate Tensorthat modulates the intensityλ\\lambdaof jumping to a neighbor state𝐲\\mathbf\{y\}\. Crucially, the state difference\(𝐲−𝐳t\)\(\\mathbf\{y\}\-\\mathbf\{z\}\_\{t\}\)represents a discrete hop on a Hamming graph, not a tangent vector in a Riemannian manifold\. Thus, the smooth differential operator∇𝐳\\nabla\_\{\\mathbf\{z\}\}required by the continuous HJB formulation is not directly available\. Discrete methods may still define useful graph\-based or finite\-difference surrogate scores, but these are not the same object as a smooth vector field on a continuous manifold\. Under the SOC lens, the system therefore relies on stochastic transition search rather than geometric guidance from a differentiable descent direction\.

##### III\. Continuous DLMs: Ill\-Conditioned Control\.

Existing continuous DLMs that apply diffusion directly to raw word embeddings restore the functional form of the optimal control SDE but can be limited due to topological constraints\. The dynamics follow the standard reverse\-time SDE:

d𝐳t=\[𝐟\(𝐳,t\)−g\(t\)2∇𝐳log⁡pt\(𝐳\)\]⏟Feedback Control𝐮\(𝐳,t\)dt\+g\(t\)d𝐰t\.d\\mathbf\{z\}\_\{t\}=\\underbrace\{\\left\[\\mathbf\{f\}\(\\mathbf\{z\},t\)\-g\(t\)^\{2\}\\nabla\_\{\\mathbf\{z\}\}\\log p\_\{t\}\(\\mathbf\{z\}\)\\right\]\}\_\{\\text\{Feedback Control \}\\mathbf\{u\}\_\{\\text\{\}\}\(\\mathbf\{z\},t\)\}dt\+g\(t\)d\\mathbf\{w\}\_\{t\}\.\(6\)Although𝐮\\mathbf\{u\}\_\{\\text\{\}\}takes the form of a closed\-loop controller via the score function, the state space𝒵emb\\mathcal\{Z\}\_\{\\text\{emb\}\}is sparse, clustered, and non\-manifold\. This irregular geometry can induce large variations in the score function, resulting in a vector field with high local Lipschitz constants\. In control theory, this characterizes astiff system, which forces numerical solvers to take smaller steps to maintain stability and can reduce the efficiency benefits of continuous modeling\.

### 3\.3Diagnosing Generative Dynamics in Autoregressive and Diffusion Flows

By contrasting the generative dynamics \([Equation4](https://arxiv.org/html/2605.14531#S3.E4),[5](https://arxiv.org/html/2605.14531#S3.E5)\) against the optimal control law𝐮∗=−∇V\\mathbf\{u\}^\{\*\}=\-\\nabla V\([Equation3](https://arxiv.org/html/2605.14531#S3.E3)\), we diagnose three issues under the SOC lens:Trajectory Singularity,Adjoint State Vanishing, andGradient Absence\. These issues help explain why current paradigms can face serial inefficiency, irreversible error propagation, and optimization intractability in globally constrained generation settings\.

I\. Efficiency Paradox via Trajectory Singularity and System Stiffness\.Optimal control favors finite kinetic energy𝒜=12∫01‖𝐮t‖2𝑑t<∞\\mathcal\{A\}=\\frac\{1\}\{2\}\\int\_\{0\}^\{1\}\\\|\\mathbf\{u\}\_\{t\}\\\|^\{2\}dt<\\infty\. However, substituting the control laws derived in[Section3\.2](https://arxiv.org/html/2605.14531#S3.SS2)reveals how several language\-modeling paradigms depart from this smoothness condition under the continuous\-control interpretation\.

*i\) Impulsive dynamics of AR:*Substituting the impulsive control law from[Equation4](https://arxiv.org/html/2605.14531#S3.E4)yields a divergence:

𝒜AR∝∫01‖∑𝐟kδ\(t−tk\)‖2𝑑t→∞\.\\mathcal\{A\}\_\{\\text\{AR\}\}\\propto\\int\_\{0\}^\{1\}\\left\\\|\\sum\\mathbf\{f\}\_\{k\}\\delta\(t\-t\_\{k\}\)\\right\\\|^\{2\}dt\\to\\infty\.\(7\)TheL2L\_\{2\}norm of the Dirac delta is not finite, so this continuous\-time embedding corresponds to a highly stiff path with impulsive curvature\. Such dynamics are poorly matched to parallel continuous solvers \(e\.g\., Picard iteration\), reflecting the serialism bottleneck of AR decoding\.

*ii\) Lipschitz Explosion of Continuous DLM:*For diffusion on raw embeddings, the sparse, non\-manifold topology can induce large variations in the score function\. The local Lipschitz constantLLexplodes:

L=sup𝐳≠𝐲‖𝐮\(𝐳\)−𝐮\(𝐲\)‖‖𝐳−𝐲‖≫1\.L=\\sup\_\{\\mathbf\{z\}\\neq\\mathbf\{y\}\}\\frac\{\\\|\\mathbf\{u\}\(\\mathbf\{z\}\)\-\\mathbf\{u\}\(\\mathbf\{y\}\)\\\|\}\{\\\|\\mathbf\{z\}\-\\mathbf\{y\}\\\|\}\\gg 1\.\(8\)Since numerical stability requires step sizesΔt<2/L\\Delta t<2/L, a largeLLforces solvers to take smaller steps \(Δt→0\\Delta t\\to 0\)\. This increases the Number of Function Evaluations \(NFE\), reducing the practical efficiency of continuous modeling and making optimization less tractable\.

II\. Irreversibility Error Propagation via Adjoint State Vanishing \(Open\-Loop\)\.Optimal policies depend on thePrediction HorizonHH, which defines the scope of the objective functionalJH\(𝐮t\)=𝔼\[∫tt\+H12‖𝐮‖2𝑑τ\+Φ\(𝐳t\+H\)\]J\_\{H\}\(\\mathbf\{u\}\_\{t\}\)=\\mathbb\{E\}\[\\int\_\{t\}^\{t\+H\}\\frac\{1\}\{2\}\\\|\\mathbf\{u\}\\\|^\{2\}d\\tau\+\\Phi\(\\mathbf\{z\}\_\{t\+H\}\)\]\. The global guidance is carried by theAdjoint State𝐩t\\mathbf\{p\}\_\{t\}, which back\-propagates the terminal potentialΦ\\Phifromt\+Ht\+H\.

*i\) Adjoint Vanishing:*Global optimality requires a full horizonH=1−tH=1\-t\. In contrast, AR effectively operates like an MPC policy with a*single\-step horizon*\(H→0H\\to 0in the continuous limit\)\. This truncation discards the integral cost beyond the immediate step, severing the backward link toΦ\(𝐳1\)\\Phi\(\\mathbf\{z\}\_\{1\}\)and forcing𝐩t≡𝟎\\mathbf\{p\}\_\{t\}\\equiv\\mathbf\{0\}\.

*ii\) Lyapunov Instability:*Without the restoring force provided by the adjoint feedback𝐮fb∝𝐩t\\mathbf\{u\}\_\{\\text\{fb\}\}\\propto\\mathbf\{p\}\_\{t\}, the system operates in an unstable open\-loop regime\. Small quantization errorsϵ\\epsilonamplify over time rather than being suppressed, leading to diverging error dynamics\.

This provides a control\-theoretic explanation forIrreversible Error Propagation\(e\.g\., hallucinations\) in AR, where early mistakes can steer subsequent generation away from the desired trajectory\.

III\. Geometric Blindness via Gradient Absence\.Efficient optimal control relies on aGradient Flow𝐮∗=−∇zV\\mathbf\{u\}^\{\*\}=\-\\nabla\_\{z\}Vto guide the state towards high\-density regions\. However, discrete diffusion models \([Equation5](https://arxiv.org/html/2605.14531#S3.E5)\) are constrained to a lattice𝒟\\mathcal\{D\}equipped with the Hamming metric\.*i\) Metric Singularity:*The discrete space𝒟\\mathcal\{D\}admits no tangent spaceT𝐳ℳT\_\{\\mathbf\{z\}\}\\mathcal\{M\}\. Consequently, the differential operator∇𝐳\\nabla\_\{\\mathbf\{z\}\}required by the HJB equation is ill\-defined\.

*ii\) Fidelity Degradation:*Without the directional guidance of a smooth gradient field−∇zV\-\\nabla\_\{z\}V, the controller relies on stochastic transition search\. Compared with continuous flows that can follow a differentiable descent direction toward the data manifold, this search can settle for sub\-optimal local states, helping explain the*fidelity gap*and lack of fine\-grained control in discrete generative models \(e\.g\., AR and DLM\)\.

This geometric limitation contributes to theEfficiency\-Fidelity Paradox: improving sample quality often requires more stochastic refinement steps, which reduces the intended parallel speedup\.

![Refer to caption](https://arxiv.org/html/2605.14531v1/x1.png)Figure 1:Generation dynamics\.On a non\-convex manifold, \(a\) AR and Diffusion are trapped in a slow, myopic crawl along the high\-curvature density ridge\. \(b\) In contrast, our method approximates the global optimal trajectory, bypassing curvature via the rectified latent geometry \(energy\-minimizing geodesic\) for improved efficiency\.![Refer to caption](https://arxiv.org/html/2605.14531v1/x2.png)Figure 2:Visualizing Generative Dynamics and Error Propagation on BVP task\.Color from pink to blue denotes generation progress\.\(a\)AR suffers fromcompounding errors\(blue lines in \(d\)\) due to open\-loop myopia, drifting off\-manifold\.\(b\)Discrete DLM relies on stochastic combinatorial search showing jagged trajectories caused bygeometric blindness\(lack of gradients\)\.\(c\)Our Manta\-LM acts as an optimal closed\-loop controller, utilizing the learned vector field to self\-correct deviations along a low\-energy geodesic\.\(d\)Quantitative analysis shows that Manta\-LM achieves superior convergence stability and reduces terminal error\.

## 4Optimal Controller as Diffusion Language Model

The SOC analysis above suggests three requirements for an effective closed\-loop language generator: a smooth control space in which continuous dynamics are tractable, locality\-preserving controllability so partial constraints can be imposed and corrected locally, and global interaction so the controller can use full\-sequence information when updating each state\. To instantiate these requirements, we proposeManta\-LM, a Latent Diffusion Language Model formulated as an optimal closed\-loop controller\. Manta\-LM uses a regularized TextVAE to construct the latent control space, a locality\-preserving convolutional encoder to maintain local controllability, and a Transformer controller to model global interactions\. Within this latent space, Flow Matching provides a practical trajectory\-level approximation to the optimal controller in[Equation3](https://arxiv.org/html/2605.14531#S3.E3), targeting high data fidelity with low inference cost in[Equation2](https://arxiv.org/html/2605.14531#S3.E2)\.

### 4\.1Control\-Friendly Manifold Rectification

Rectification via Diffeomorphism\.We introduce a Variational Autoencoder \(VAE\) as a coordinate transformationψ:𝒟→𝒵\\psi:\\mathcal\{D\}\\to\\mathcal\{Z\}, regularized by a Gaussian prior viaℒKL\\mathcal\{L\}\_\{\\text\{KL\}\}\. This regularization encourages𝒵\\mathcal\{Z\}to bediffeomorphic to Euclidean space, effectively performingManifold Rectificationon the non\-metric token space\. By constructing a continuous, locally Euclidean latent representation, we obtain a setting where the gradient operator∇𝐳\\nabla\_\{\\mathbf\{z\}\}is well\-defined and the learned dynamics can be more Lipschitz regular\. This yields aControl\-Friendly Geometry, satisfying the theoretical prerequisites for stable, unique ODE\(Bullo & Lewis,[2004](https://arxiv.org/html/2605.14531#bib.bib5)\)solutions within our Optimal Control framework\.

Topology\-Preserving Compression\.Transformer\-based VAEs suffer from global information entanglement, where each latent vector aggregates information from the entire sequence\. This destroys the spatial locality required for controllable generation, rendering masked in\-painting mathematically ill\-posed due to inevitable information leakage\. We resolve this by employing Local Integral Operators to enforce strict spatial spatial disentanglement\.

![Refer to caption](https://arxiv.org/html/2605.14531v1/x3.png)Figure 3:Geometric comparison\.Unlike \(a\) Autoregressive models’ serial paths or \(b\-c\) Diffusion baselines’ high\-curvature trajectories in ill\-conditioned spaces, \(d\) Ours operates on a rectified latent manifold\. The learned optimal vector fieldvθv\_\{\\theta\}enables energy\-minimizing, straight\-line transport from noise to data\.
### 4\.2Flow Matching as the Lagrangian Solver

Directly solving the PDE in[Section3\.1](https://arxiv.org/html/2605.14531#S3.SS1)is intractable\. However, we leverage the duality between the Eulerian \(PDE\) and Lagrangian \(Path\) specifications\. Conditional Flow Matching \(CFM\) with Optimal Transport paths parameterizes the target trajectory as𝐳t=\(1−t\)𝐳0\+t𝐳1\\mathbf\{z\}\_\{t\}=\(1\-t\)\\mathbf\{z\}\_\{0\}\+t\\mathbf\{z\}\_\{1\}\. The velocity field of this path is constant:𝐮t\(𝐳\|𝐳1\)=d𝐳tdt=𝐳1−𝐳0\\mathbf\{u\}\_\{t\}\(\\mathbf\{z\}\|\\mathbf\{z\}\_\{1\}\)=\\frac\{d\\mathbf\{z\}\_\{t\}\}\{dt\}=\\mathbf\{z\}\_\{1\}\-\\mathbf\{z\}\_\{0\}\.

Approximate solution of HJB:By the Benamou\-Brenier theory\(Benamou & Brenier,[2000](https://arxiv.org/html/2605.14531#bib.bib2)\), straight\-line interpolation minimizes the kinetic energy action𝒜=∫‖𝐮‖2𝑑t\\mathcal\{A\}=\\int\\\|\\mathbf\{u\}\\\|^\{2\}dtin Wasserstein space\. Therefore, the marginal vector field learned by the network,vθ∗\(𝐳,t\)=𝔼\[𝐮t\|𝐳t=𝐳\]v\_\{\\theta\}^\{\*\}\(\\mathbf\{z\},t\)=\\mathbb\{E\}\[\\mathbf\{u\}\_\{t\}\|\\mathbf\{z\}\_\{t\}=\\mathbf\{z\}\], provides a tractable approximation to the HJB\-induced dynamics\(Bertucci,[2023](https://arxiv.org/html/2605.14531#bib.bib3)\)\. Thus, minimizing the simple regression loss:

ℒCFM\(θ\)=𝔼t,𝐳0,𝐳1‖vθ\(𝐳t,t\)−\(𝐳1−𝐳0\)‖2,\\mathcal\{L\}\_\{\\text\{CFM\}\}\(\\theta\)=\\mathbb\{E\}\_\{t,\\mathbf\{z\}\_\{0\},\\mathbf\{z\}\_\{1\}\}\\left\\\|v\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},t\)\-\(\\mathbf\{z\}\_\{1\}\-\\mathbf\{z\}\_\{0\}\)\\right\\\|^\{2\},\(9\)serves as a practical Lagrangian surrogate for the stochastic optimal control problem in rectified latent space\.[Figure2](https://arxiv.org/html/2605.14531#S3.F2)\(c\) indicates that this solution can be more efficient and accurate than AR and Discrete Diffusion\.

### 4\.3Transformer as the Global Integral Operator

Finally, we analyze the architectural realization of the control law𝐮∗\\mathbf\{u\}^\{\*\}\. SinceV\(𝐳\)V\(\\mathbf\{z\}\)depends on the global configuration of the latent particles \(tokens\), the gradient∇𝐳V\\nabla\_\{\\mathbf\{z\}\}Vis a non\-local operator\. We model the latent state as an interacting particle system\. The total force on particleiiis given by the integral of pairwise interactions:

𝐮\(i\)\(𝐳\)=−∇𝐳\(i\)V≈∫Ω𝒦\(𝐳\(i\),𝐳\(j\)\)𝐠\(𝐳\(j\)\)𝑑𝐳\(j\)\.\\mathbf\{u\}^\{\(i\)\}\(\\mathbf\{z\}\)=\-\\nabla\_\{\\mathbf\{z\}^\{\(i\)\}\}V\\approx\\int\_\{\\Omega\}\\mathcal\{K\}\(\\mathbf\{z\}^\{\(i\)\},\\mathbf\{z\}^\{\(j\)\}\)\\mathbf\{g\}\(\\mathbf\{z\}^\{\(j\)\}\)d\\mathbf\{z\}^\{\(j\)\}\.\(10\)The Transformer Self\-Attention mechanism is precisely the discrete approximation of this integral operator:

Attention\(𝐳\)i=∑j=1Nexp⁡\(𝐖q𝐳\(i\)\(𝐖k𝐳\(j\)\)⊤\)𝒵⏟Kernel𝒦ij𝐖v𝐳\(j\)⏟Force𝐠j\.\\text\{Attention\}\(\\mathbf\{z\}\)\_\{i\}=\\sum\_\{j=1\}^\{N\}\\underbrace\{\\frac\{\\exp\(\\mathbf\{W\}\_\{q\}\\mathbf\{z\}^\{\(i\)\}\(\\mathbf\{W\}\_\{k\}\\mathbf\{z\}^\{\(j\)\}\)^\{\\top\}\)\}\{\\mathcal\{Z\}\}\}\_\{\\text\{Kernel \}\\mathcal\{K\}\_\{ij\}\}\\underbrace\{\\mathbf\{W\}\_\{v\}\\mathbf\{z\}^\{\(j\)\}\}\_\{\\text\{Force \}\\mathbf\{g\}\_\{j\}\}\.\(11\)Therefore, the Transformer is not an arbitrary choice; it is the structural discretization of the global gradient flow required by the HJB dynamics\.

Table 1:Theoretic comparison of generative dynamics\.

## 5Experiments

### 5\.1Experimental Setup

Zero\-shot Language Modeling\.To rigorously evaluate density estimation, the primary benchmark where Discrete DLMs seek to challenge AR models, we strictly adhere to the protocols established by recent discrete diffusion studies\(Lou et al\.,[2023](https://arxiv.org/html/2605.14531#bib.bib28); Ou et al\.,[2024](https://arxiv.org/html/2605.14531#bib.bib34)\)\. Specifically, models are trained on OpenWebText\(Gokaslan & Cohen,[2019](https://arxiv.org/html/2605.14531#bib.bib14)\)following SEDD’s processing\(Lou et al\.,[2023](https://arxiv.org/html/2605.14531#bib.bib28)\)and evaluated onfivebenchmarks: LAMBADA\(Paperno et al\.,[2016](https://arxiv.org/html/2605.14531#bib.bib35)\), WikiText\-2/103\(Merity et al\.,[2016](https://arxiv.org/html/2605.14531#bib.bib30)\), PTB, and 1BW\(Radford et al\.,[2019](https://arxiv.org/html/2605.14531#bib.bib36)\)\.

Conditional Text Generation\.Following the evaluation pipeline of DiffuSeq\(Gong et al\.,[2023](https://arxiv.org/html/2605.14531#bib.bib16)\), models are evaluated on four representative conditional text generation tasks: Paraphrase Generation on QQP\(DataCanary et al\.,[2017](https://arxiv.org/html/2605.14531#bib.bib10)\);*Question Generation*on Quasar\-T\(Dhingra et al\.,[2017](https://arxiv.org/html/2605.14531#bib.bib11)\);*Text Simplification*on Wiki\-Auto\(Jiang et al\.,[2020](https://arxiv.org/html/2605.14531#bib.bib22)\); and*Open\-domain Dialogue*on CCD\(Zhou et al\.,[2018](https://arxiv.org/html/2605.14531#bib.bib47)\)\.

Evaluation Metrics\.Our evaluation focuses on both quality and diversity\. Quantitatively, we measure generation quality via BLEU, ROUGE\-L \(R\-L\), and BERTScore \(Score\), while diversity is assessed using Distinct\-1 \(D\-1\)\. For zero\-shot density estimation, we report Perplexity \(PPL\)\. For unconditional free generation, we additionally report generation perplexity \(Gen PPL\) together with unigram entropy\.

### 5\.2Main Results

##### Zero\-shot Language Modeling\.

As shown in[Table2](https://arxiv.org/html/2605.14531#S5.T2), our model consistently outperforms baselines, establishing a new state\-of\-the\-art over discrete diffusion and AR models\. On the massive 1BW dataset, we achieve a perplexity of 62\.55, improving over the strong diffusion baseline MD4 \(68\.10\) and the autoregressive GPT\-2 baseline \(75\.2\)\. Substantial reductions on LAMBADA \(28%\) and WikiText\-2 \(12%\) further suggest that our latent optimal control architecture captures complex dependencies superior to prior discrete formulations, effectively bridging the gap to autoregressive density estimation\.

Table 2:ELBO\-based zero\-shot perplexity \(↓\\downarrow\)\. Diffusion\-based likelihoods are evaluated with the upper\-bound/proxy protocol used in prior diffusion\-LM work\. The Manta\-LM result uses the 101M model\.For[Table2](https://arxiv.org/html/2605.14531#S5.T2), the reported diffusion\-model PPL is an ELBO\-based density\-estimation proxy rather than direct free\-generation likelihood\. For Manta\-LM, the bound decomposes into an encoder posterior term, a flow\-prior density term computed by change of variables along the learned ODE, and a decoder likelihood term:

logpθ\(x\)≥𝔼qϕ\(𝐳∣x\)\[\\displaystyle\\log p\_\{\\theta\}\(x\)\\geq\\mathbb\{E\}\_\{q\_\{\\phi\}\(\\mathbf\{z\}\\mid x\)\}\\big\[log⁡pθ\(x∣𝐳\)\+log⁡pθ\(𝐳\)\\displaystyle\\log p\_\{\\theta\}\(x\\mid\\mathbf\{z\}\)\+\\log p\_\{\\theta\}\(\\mathbf\{z\}\)\(12\)−logqϕ\(𝐳∣x\)\]\.\\displaystyle\-\\log q\_\{\\phi\}\(\\mathbf\{z\}\\mid x\)\\big\]\.We computePPL=exp⁡\(−ℒELBO/N\)\\mathrm\{PPL\}=\\exp\(\-\\mathcal\{L\}\_\{\\mathrm\{ELBO\}\}/N\)overNNtokens, following the same zero\-shot evaluation protocol as discrete diffusion baselines\.

Table 3:Unconditional generation quality and diversity\. Gen PPL \(↓\\downarrow\) measures free\-generation fidelity, while unigram entropy \(↑\\uparrow\) monitors diversity\.Table 4:Method comparison\.∙indicates AR models,⋆indicates Non\-autoregressive Models and‡indicates Non\-autoregressive Diffusion Models\.Boldfacedresults show the best across all models;Underlinedresults show the best across all non\-AR models\.
##### Conditional Text Generation\.

[Table4](https://arxiv.org/html/2605.14531#S5.T4)summarizes our performance against AR models and Non\-AR models baselines\.*1\) Quality & AR\-Parity:*Manta\-LMsets a new non\-autoregressive SOTA, surpassing SeqDiffuSeq by over 5 BLEU on Paraphrase Generation\. Crucially, we achieve parity with or even surpass fine\-tuned GPT2\-large on Text Simplification and Question Generation, demonstrating ARM\-level fidelity with Non\-AR inference benefits\.Diversity:Our model maintains high Distinct\-1 scores comparable to DiffuSeq\. In high\-entropy Open Domain Dialogue, we achieve the highest BLEU \(1\.95\) among the compared methods, suggesting robustness in modeling complex one\-to\-many mappings\.

Table 5:Reconstruction and robustness of TextVAE\.TextVAE achieves a bidirectional mapping between the token space and the latent control space\. Reconstruction results demonstrate that it performs this mapping with high precision, and lower Lipschitz constants indicate strong robustness in the latent control space, allowing for a certain margin of error in the controller\.Table 6:Self correction performance\.Natural Language Inference \(NLI\) measures the semantic consistency between the generated text and the ground\-truth text\. Reward Model \(RM\) evaluates the quality of correction using an external learned reward model \(ArmoRM\-Llama3\-8B\-v0\.1\)\. BERTScore \(Score\) evaluates semantic similarity, and Levenshtein Distance \(Lev\.\) measures the average edit distance to the ground\-truth text\.Table 7:Text infilling quality\.MAUVE measures the similarity between model\-generated infillings and reference texts, given prefix–suffix pairs\. Higher values indicate better alignment with human\-like text distributions\.

### 5\.3Evaluation on Global Closed\-Loop Generation

While zero\-shot perplexity validates density estimation as an Initial Value Problem \(IVP\), it offers limited insight into global feedback mechanisms\. To analyze theClosed\-Loop Generationperformance of our framework, we extend our evaluation toBoundary Value Problems \(BVPs\), formulating generation as a trajectory optimization task under constraints\. Such tasks are challenging for standard causal AR models due to their unidirectional visibility and for discrete diffusion due to the lack of smooth gradient guidance, yet they naturally fit the operating mode of our optimal control policy\.

Constrained Self\-Correction\.We first evaluate*error correction*, a task requiring the model to act as a closed\-loop controller that projects off\-manifold \(noisy, hallucination\) states back onto the clean data manifold\. We introduce multi\-granular perturbations to the WikiText\-2 and OpenWebText datasets: \(1\)Lexical Noise\(homoglyph/deletion\), \(2\)Semantic Swaps\(entity substitution\), and \(3\)Logical Permutations\(temporal reordering\)\. Standard AR models, constrained by unidirectional attention, treat errors as context to be continued rather than states to be corrected\. In contrast, our model utilizes the learned vector field as a*restoring force*to minimize transport energy toward the valid distribution\.

In Table[6](https://arxiv.org/html/2605.14531#S5.T6), we report NLI consistency and Reward Model \(RM\) scores to measure semantic validity, alongside Levenshtein Distance to penalize unrestricted rewriting\. As shown in Table[6](https://arxiv.org/html/2605.14531#S5.T6), our Manta\-LM achieves a favorable balance between high semantic fidelity \(0\.835 BERTScore\) and low edit distance \(169\.1\)\. While baselines like Qwen2\.5\-Instruct\(Yang et al\.,[2024](https://arxiv.org/html/2605.14531#bib.bib44)\)achieve high NLI scores by rewriting the entire sentence \(high Levenshtein\), our model performssurgicalcorrections\. This empirically confirms our theoretical claim: the optimal controller is robust in latent control space, effectively self\-correction the error states while preserving the uncorrupted global structure\.

Generative Interpolation \(i\.e\., Long\-Form Infilling\)\.We further task the model with*long\-form infilling*, bridging two distant boundary conditions, a capability requiring long\-horizon look\-ahead\. Using OpenWebText, we construct a challenging benchmark with an aggressive 1:8:1 split \(masking the central 80% of text\), stratified across sequence lengths up to 1024 tokens\. This task is challenging for standard ARMs: without specialized Fill\-In\-the\-Middle \(FIM\) pre\-training, causal models do not directly condition on the suffix \(future boundary\), which can result in incoherent bridges\.

In Table[7](https://arxiv.org/html/2605.14531#S5.T7), we prioritize distributional alignment and coherence using the MAUVE score\. Our method demonstrates exceptional infilling performance, successfully generating coherent transitions that respect both prefix and suffix constraints\. This validates that our model solves the*Two\-Point Boundary Value Problem*by relaxing the latent trajectory between fixed endpoints\. By leveraging the continuous topology of the latent space, our approach avoids the discontinuity issues of discrete infilling, establishing a new paradigm for controllable long\-context generation\.

### 5\.4Model Evaluation and Analysis

Efficiency Analysis\.Figure[4](https://arxiv.org/html/2605.14531#S5.F4)demonstrates the dual efficiency gains of our framework\. By leveraging4×4\\timeslatent compression and an energy\-minimizing straight trajectory, our model achieves nearly4×4\\timeshigher throughput per step while requiring significantly fewer sampling steps \(NFE\) than baselines\. Crucially, our inference cost is decoupled from sequence length \(O\(1\)O\(1\)steps\), breaking the linear dependency \(O\(N\)O\(N\)\) inherent to AR models\. For 1024\-token generation, this culminates in a4\.6×\\timesspeedup over the most efficient baseline, validating the practical scalability of our optimal control formulation\.

![Refer to caption](https://arxiv.org/html/2605.14531v1/x4.png)Figure 4:Efficiency evaluation with inference throughput\.Geometric Analysis: Reduce Stiffness\.To validate the necessity of the latent space and theoretical claims made in[Section3](https://arxiv.org/html/2605.14531#S3), we compare the transport dynamics in the raw embedding space versus our rectified latent manifold using two metrics: \(1\)*Trajectory Curvature*κ\(t\)≈‖𝐳¨t‖/‖𝐳˙t‖2\\kappa\(t\)\\approx\{\\\|\\ddot\{\\mathbf\{z\}\}\_\{t\}\\\|\}/\{\\\|\\dot\{\\mathbf\{z\}\}\_\{t\}\\\|^\{2\}\}, which quantifies the deviation from the optimal straight\-line transport; and \(2\)*Vector Field Stiffness*𝒮\(𝐳\)=‖∇𝐳vt\(𝐳\)‖F\\mathcal\{S\}\(\\mathbf\{z\}\)=\\\|\\nabla\_\{\\mathbf\{z\}\}v\_\{t\}\(\\mathbf\{z\}\)\\\|\_\{F\}, which estimates the local Lipschitz constant governing numerical stability\.

![Refer to caption](https://arxiv.org/html/2605.14531v1/x5.png)Figure 5:Stiffness Analysis\.The raw Token \(Embedding\) Space exhibits extreme stiffness and high curvature, indicating an ill\-conditioned control landscape that forces adaptive solvers \(RK45\) to high NFE\. In contrast, our Rectified Latent Space maintains low stiffness and near\-linear trajectories, verifying the efficacy of VAE\.Ill\-Conditioned Embedding vs\. Rectified Latent\.As shown in Figure[5](https://arxiv.org/html/2605.14531#S5.F5), the raw embedding space exhibits extreme stiffness and curvature\. This geometric pathology indicates a non\-convex landscape where the vector field must undergo drastic directional changes\. According to the Picard\-Lindelöf theorem, such exploding Lipschitz constants \(𝒮→∞\\mathcal\{S\}\\to\\infty\) render the ODEstiff, forcing adaptive solvers to take infinitesimally small steps \(high NFE\) to maintain stability\. In contrast, our latent model maintains low stiffness and near\-linear trajectories\. This empirically confirms that the VAE performs*Manifold Rectification*, transforming the ill\-conditioned high\-frequency regression problem into a well\-conditioned one, thereby making the HJB\-inspired dynamics easier to approximate with Flow Matching and efficient large\-step integration\.

![Refer to caption](https://arxiv.org/html/2605.14531v1/x6.png)Figure 6:Optimization landscapes of different generation paradigms\.\(a\) AR exhibits sharp and unstable geometry\. \(b\) Discrete diffusion leads to fragmented and irregular landscapes\. \(c\) Our Manta\-LM yields a smooth and well\-conditioned landscape, enabling stable optimization\.Geometric Regularity and Optimization Stability\.[Figure6](https://arxiv.org/html/2605.14531#S5.F6)contrasts the rugged optimization landscape of discrete baselines, which reflects severe ill\-conditioning, with the smooth and wide\-valley geometry induced by our CTRL\-LM\. This topological regularity empirically supports the effect of manifold rectification\. By transforming a chaotic combinatorial search process into a well\-conditioned optimal control problem, our method yields intrinsically stable dynamics and demonstrates clear advantages over brittle discrete formulations\.

## 6Conclusion

We presentedManta\-LM, a framework that studies and re\-imagines text generation as Stochastic Optimal Control problem\. By approximating HJB\-inspired dynamics with Flow Matching on a rectified manifold, our model overcomes the “myopic” limitations of autoregressive baselines, enabling global trajectory planning and closed\-loop feedback\-based refinement\. Empirical results show strong zero\-shot density estimation and favorable performance on Boundary Value Problems, such as non\-causal infilling and error correction, that are difficult for standard causal models without specialized mechanisms\. This work connects discrete language modeling with continuous dynamical systems and offers a robust, mathematically grounded path toward efficient and controllable text generation\.

##### Potential Limitations and Future Directions

While the proposed global planning paradigm enables stable and well\-guided generation, it may be computationally less efficient for interruptible agentic tasks, such as tool calling, compared to fine\-grained autoregressive inference\. Adapting continuous flow dynamics to support low\-cost, sequential interruptions without incurring redundant computation remains an important direction for future work\.

## Impact Statement

This paper presents work whose goal is to advance the field of machine learning\. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here\.

## References

- Austin et al\. \(2021\)Austin, J\., Johnson, D\. D\., Ho, J\., Tarlow, M\., and van den Berg, R\.Structured denoising diffusion models in discrete state\-spaces\.In*Advances in Neural Information Processing Systems*, volume 34, pp\. 17981–17993, 2021\.
- Benamou & Brenier \(2000\)Benamou, J\.\-D\. and Brenier, Y\.A computational fluid mechanics solution to the monge\-kantorovich mass transfer problem\.*Numerische Mathematik*, 84\(3\):375–393, 2000\.
- Bertucci \(2023\)Bertucci, C\.Stochastic optimal transport and hamilton–jacobi–bellman equations on the set of probability measures\.*Annales de l’Institut Henri Poincaré C, Analyse non linéaire*, 2023\.URL[https://api\.semanticscholar\.org/CorpusID:259095954](https://api.semanticscholar.org/CorpusID:259095954)\.
- Brown et al\. \(2020\)Brown, T\., Mann, B\., Ryder, N\., Subbiah, M\., Kaplan, J\. D\., Dhariwal, P\., Neelakantan, A\., Shyam, P\., Sastry, G\., Askell, A\., et al\.Language models are few\-shot learners\.In*Advances in neural information processing systems*, volume 33, pp\. 1877–1901, 2020\.
- Bullo & Lewis \(2004\)Bullo, F\. and Lewis, A\. D\.Geometric control of mechanical systems\.2004\.URL[https://api\.semanticscholar\.org/CorpusID:679624](https://api.semanticscholar.org/CorpusID:679624)\.
- Chelba et al\. \(2013\)Chelba, C\., Mikolov, T\., Schuster, M\., Ge, Q\., Brants, T\., Koehn, P\. T\., and Robinson, T\.One billion word benchmark for measuring progress in statistical language modeling\.In*Interspeech*, 2013\.
- Chen et al\. \(2024\)Chen, J\., Cai, H\., Chen, J\., Xie, E\., Yang, S\., Tang, H\., Li, M\., Lu, Y\., and Han, S\.Deep compression autoencoder for efficient high\-resolution diffusion models\.*arXiv preprint arXiv:2410\.10733*, 2024\.
- Cheng et al\. \(2024\)Cheng, C\., Li, J\., Peng, J\., and Liu, G\.Categorical flow matching on statistical manifolds\.*Advances in Neural Information Processing Systems*, 37:54787–54819, 2024\.
- Cobbe et al\. \(2021\)Cobbe, K\., Kosaraju, V\., Bavarian, M\., Chen, M\., Jun, H\., Kaiser, L\., Plappert, M\., Tworek, J\., Hilton, J\., Nakano, R\., Hesse, C\., and Schulman, J\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.
- DataCanary et al\. \(2017\)DataCanary, hilfialkaff, Jiang, L\., Risdal, M\., Dandekar, N\., and tomtung\.Quora question pairs\.Kaggle Competition, 2017\.[https://kaggle\.com/competitions/quora\-question\-pairs](https://kaggle.com/competitions/quora-question-pairs)\.
- Dhingra et al\. \(2017\)Dhingra, B\., Mazaitis, K\., and Cohen, W\. W\.Quasar: Datasets for question answering by search and reading\.*arXiv preprint arXiv:1707\.03904*, 2017\.
- Dieleman et al\. \(2022\)Dieleman, S\., Sartran, L\., Roshannai, A\., Savinov, N\., Ganin, Y\., Richemond, P\. H\., Doucet, A\., Strudel, R\., Dyer, C\., Durkan, C\., et al\.Continuous diffusion for categorical data\.*arXiv preprint arXiv:2211\.15089*, 2022\.
- Fleming & Rishel \(2012\)Fleming, W\. H\. and Rishel, R\. W\.*Deterministic and stochastic optimal control*\.Springer Science & Business Media, 2012\.
- Gokaslan & Cohen \(2019\)Gokaslan, A\. and Cohen, V\.Openwebtext corpus\.[http://Skylion007\.github\.io/OpenWebTextCorpus](http://skylion007.github.io/OpenWebTextCorpus), 2019\.
- \(15\)Gong, S\., Li, M\., Feng, J\., Wu, Z\., and Kong, L\.Diffuseq\-v2: Bridging discrete and continuous text spaces for accelerated seq2seq diffusion models\.In*The 2023 Conference on Empirical Methods in Natural Language Processing*\.
- Gong et al\. \(2023\)Gong, S\., Li, M\., Feng, J\., Wu, Z\., and Kong, L\.Diffuseq: Sequence to sequence text generation with diffusion models\.In*International Conference on Learning Representations \(ICLR 2023\)\(01/05/2023\-05/05/2023, Kigali, Rwanda\)*, 2023\.
- Gong et al\. \(2024\)Gong, S\., Agarwal, S\., Zhang, Y\., Ye, J\., Zheng, L\., Li, M\., An, C\., Zhao, P\., Bi, W\., Han, J\., et al\.Scaling diffusion language models via adaptation from autoregressive models\.*arXiv preprint arXiv:2410\.17891*, 2024\.
- Gulrajani & Hashimoto \(2023\)Gulrajani, I\. and Hashimoto, T\. B\.Likelihood\-based diffusion language models\.*Advances in Neural Information Processing Systems*, 36:16693–16715, 2023\.
- Han et al\. \(2023\)Han, X\., Kumar, S\., and Tsvetkov, Y\.Ssd\-lm: Semi\-autoregressive simplex\-based diffusion language model for text generation and modular control\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 11575–11596, 2023\.
- Hoogeboom et al\. \(2021\)Hoogeboom, E\., Nielsen, D\., Jaini, P\., Forré, P\., and Welling, M\.Argmax flows and multinomial diffusion: Learning categorical distributions\.*Advances in neural information processing systems*, 34:12454–12465, 2021\.
- Huang et al\. \(2025\)Huang, S\., Cheng, T\., Liu, J\. K\., Xu, W\., Hao, J\., Song, L\., Xu, Y\., Yang, J\., Liu, J\., Zhang, C\., et al\.Opencoder: The open cookbook for top\-tier code large language models\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 33167–33193, 2025\.
- Jiang et al\. \(2020\)Jiang, C\., Maddela, M\., Lan, W\., Zhong, Y\., and Xu, W\.Neural crf model for sentence alignment in text simplification\.*arXiv preprint arXiv:2005\.02324*, 2020\.
- Jo & Hwang \(2025\)Jo, J\. and Hwang, S\. J\.Continuous diffusion model for language modeling\.In*Neural Information Processing Systems*, 2025\.
- Li et al\. \(2025a\)Li, J\., Du, L\., Zhao, H\., Zhang, B\.\-w\., Wang, L\., Gao, B\., Liu, G\., and Lin, Y\.Infinity instruct: Scaling instruction selection and synthesis to enhance language models\.*arXiv preprint arXiv:2506\.11116*, 2025a\.
- Li et al\. \(2025b\)Li, S\., Gu, J\., Liu, K\., Lin, Z\., Wei, Z\., Grover, A\., and Kuen, J\.Lavida\-o: Elastic large masked diffusion models for unified multimodal understanding and generation\.*arXiv preprint arXiv:2509\.19244*, 2025b\.
- Li et al\. \(2022\)Li, X\., Thickstun, J\., Gulrajani, I\., Liang, P\. S\., and Hashimoto, T\. B\.Diffusion\-lm improves controllable text generation\.*Advances in neural information processing systems*, 35:4328–4343, 2022\.
- Lipman et al\. \(2023\)Lipman, Y\., Chen, R\. T\., Ben\-Hamu, H\., Nickel, M\., and Le, M\.Flow matching for generative modeling\.In*International Conference on Learning Representations*, 2023\.
- Lou et al\. \(2023\)Lou, A\., Meng, C\., and Ermon, S\.Discrete diffusion modeling by estimating the ratios of the data distribution\.*arXiv preprint arXiv:2310\.16834*, 2023\.
- Mahabadi et al\. \(2024\)Mahabadi, R\. K\., Ivison, H\., Tae, J\., Henderson, J\., Beltagy, I\., Peters, M\. E\., and Cohan, A\.Tess: Text\-to\-text self\-conditioned simplex diffusion\.In*Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 2347–2361, 2024\.
- Merity et al\. \(2016\)Merity, S\., Xiong, C\., Bradbury, J\., and Socher, R\.Pointer sentinel mixture models, 2016\.
- Moshkov et al\. \(2025\)Moshkov, I\., Hanley, D\., Sorokin, I\., Toshniwal, S\., Henkel, C\., Schifferer, B\., Du, W\., and Gitman, I\.Aimo\-2 winning solution: Building state\-of\-the\-art mathematical reasoning models with openmathreasoning dataset\.*arXiv preprint arXiv:2504\.16891*, 2025\.
- Nie et al\. \(2025a\)Nie, S\., Zhu, F\., You, Z\., Zhang, X\., Ou, J\., Hu, J\., Zhou, J\., Lin, Y\., Wen, J\.\-R\., and Li, C\.Large language diffusion models\.In*Neural Information Processing Systems*, 2025a\.
- Nie et al\. \(2025b\)Nie, S\., Zhu, F\., You, Z\., Zhang, X\., Ou, J\., Hu, J\., Zhou, J\., Lin, Y\., Wen, J\.\-R\., and Li, C\.Large language diffusion models\.*arXiv preprint arXiv:2502\.09992*, 2025b\.
- Ou et al\. \(2024\)Ou, J\., Nie, S\., Xue, K\., Zhu, F\., Sun, J\., Li, Z\., and Li, C\.Your absorbing discrete diffusion secretly models the conditional distributions of clean data\.*arXiv preprint arXiv:2406\.03736*, 2024\.
- Paperno et al\. \(2016\)Paperno, D\., Kruszewski, G\., Lazaridou, A\., Pham, N\.\-Q\., Bernardi, R\., Pezzelle, S\., Baroni, M\., Boleda, G\., and Fernández, R\.The lambada dataset: Word prediction requiring a broad discourse context\.In*Proceedings of the 54th annual meeting of the association for computational linguistics \(volume 1: Long papers\)*, pp\. 1525–1534, 2016\.
- Radford et al\. \(2019\)Radford, A\., Wu, J\., Child, R\., Luan, D\., Amodei, D\., Sutskever, I\., et al\.Language models are unsupervised multitask learners\.*OpenAI blog*, 1\(8\):9, 2019\.
- Sahoo et al\. \(2024\)Sahoo, S\., Arriola, M\., Schiff, Y\., Gokaslan, A\., Marroquin, E\., Chiu, J\., Rush, A\., and Kuleshov, V\.Simple and effective masked diffusion language models\.*Advances in Neural Information Processing Systems*, 37:130136–130184, 2024\.
- Shi et al\. \(2024\)Shi, J\., Han, K\., Wang, Z\., Doucet, A\., and Titsias, M\.Simplified and generalized masked diffusion for discrete data\.*Advances in neural information processing systems*, 37:103131–103167, 2024\.
- Sohl\-Dickstein et al\. \(2015\)Sohl\-Dickstein, J\., Weiss, E\., Maheswaranathan, N\., and Ganguli, S\.Deep unsupervised learning using nonequilibrium thermodynamics\.In*International conference on machine learning*, pp\. 2256–2265\. pmlr, 2015\.
- Strudel et al\. \(2022\)Strudel, R\., Tallec, C\., Altché, F\., Du, Y\., Ganin, Y\., Mensch, A\., Grathwohl, W\., Savinov, N\., Dieleman, S\., Sifre, L\., et al\.Self\-conditioned embedding diffusion for text generation\.*arXiv preprint arXiv:2211\.04236*, 2022\.
- Sun et al\. \(2022\)Sun, H\., Yu, L\., Dai, B\., Schuurmans, D\., and Dai, H\.Score\-based continuous\-time discrete diffusion models\.*arXiv preprint arXiv:2211\.16750*, 2022\.
- Swerdlow et al\. \(2025\)Swerdlow, A\., Prabhudesai, M\., Gandhi, S\., Pathak, D\., and Fragkiadaki, K\.Unified multimodal discrete diffusion\.*arXiv preprint arXiv:2503\.20853*, 2025\.
- Touvron et al\. \(2023\)Touvron, H\., Lavril, T\., Izacard, G\., Martinet, X\., Lachaux, M\.\-A\., Lacroix, T\., Rozière, B\., Goyal, N\., Hambro, E\., Azhar, F\., et al\.Llama: Open and efficient foundation language models\.*arXiv preprint arXiv:2302\.13971*, 2023\.
- Yang et al\. \(2024\)Yang, A\., Yang, B\., Hui, B\., Zheng, B\., Yu, B\., Zhou, C\., Li, C\., Li, C\., Liu, D\., Huang, F\., Dong, G\., Wei, H\., Lin, H\., Tang, J\., Wang, J\., Yang, J\., Tu, J\., Zhang, J\., Ma, J\., Xu, J\., Zhou, J\., Bai, J\., He, J\., Lin, J\., Dang, K\., Lu, K\., Chen, K\.\-Y\., Yang, K\., Li, M\., Xue, M\., Ni, N\., Zhang, P\., Wang, P\., Peng, R\., Men, R\., Gao, R\., Lin, R\., Wang, S\., Bai, S\., Tan, S\., Zhu, T\., Li, T\., Liu, T\., Ge, W\., Deng, X\., Zhou, X\., Ren, X\., Zhang, X\., Wei, X\., Ren, X\., Fan, Y\., Yao, Y\., Zhang, Y\., Wan, Y\., Chu, Y\., Cui, Z\., Zhang, Z\., and Fan, Z\.\-W\.Qwen2 technical report\.*ArXiv*, abs/2407\.10671, 2024\.
- Yang et al\. \(2025\)Yang, L\., Tian, Y\., Li, B\., Zhang, X\., Shen, K\., Tong, Y\., and Wang, M\.Mmada: Multimodal large diffusion language models\.*arXiv preprint arXiv:2505\.15809*, 2025\.
- Ye et al\. \(2025\)Ye, J\., Xie, Z\., Zheng, L\., Gao, J\., Wu, Z\., Jiang, X\., Li, Z\., and Kong, L\.Dream 7b: Diffusion large language models\.*arXiv preprint arXiv:2508\.15487*, 2025\.
- Zhou et al\. \(2018\)Zhou, H\., Young, T\., Huang, M\., Zhao, H\., Xu, J\., and Zhu, X\.Commonsense knowledge aware conversation generation with graph attention\.In*IJCAI*, volume 18, pp\. 4623–4629, 2018\.

## Appendix AAppendix

### A\.1Find the Optimal Control Law via Solving HJB

We formally derive the optimal control law that governs our Latent Flow LLM\. Following the Benamou\-Brenier formulation for dynamic optimal transport\(Benamou & Brenier,[2000](https://arxiv.org/html/2605.14531#bib.bib2)\), we seek a control policy𝐮∗\\mathbf\{u\}^\{\*\}that minimizes the transport cost functionalJ\(𝐮\)J\(\\mathbf\{u\}\)\. In the specific context of text generation, theTerminal CostΦ\\Phiis strictly defined by the data likelihood:

Φ\(𝐳1\)=−log⁡pdata\(𝐳1\)\.\\Phi\(\\mathbf\{z\}\_\{1\}\)=\-\\log p\_\{\\text\{data\}\}\(\\mathbf\{z\}\_\{1\}\)\.\(13\)Thus, the total objective functional becomes:

J\(𝐮\)=𝔼\[−log⁡pdata\(𝒟\(𝐳1\)\)⏟Semantic Fidelity\+λ∫0112‖𝐮t\(𝐳t\)‖2⏟Transport Energy𝑑t\]\.J\(\\mathbf\{u\}\)=\\mathbb\{E\}\\left\[\\underbrace\{\-\\log p\_\{\\text\{data\}\}\(\\mathcal\{D\}\(\\mathbf\{z\}\_\{1\}\)\)\}\_\{\\text\{Semantic Fidelity\}\}\+\\lambda\\int\_\{0\}^\{1\}\\underbrace\{\\frac\{1\}\{2\}\\\|\\mathbf\{u\}\_\{t\}\(\\mathbf\{z\}\_\{t\}\)\\\|^\{2\}\}\_\{\\text\{Transport Energy\}\}dt\\right\]\.\(14\)
The Hamilton\-Jacobi\-Bellman \(HJB\) Equation\.To solve for the optimal trajectory, we define theValue FunctionV\(𝐳,t\)V\(\\mathbf\{z\},t\)as the minimum expected cost\-to\-go from state𝐳\\mathbf\{z\}at timett:

V\(𝐳,t\)=inf𝐮𝔼\[Φ\(𝐳1\)\+∫t112‖𝐮‖2𝑑τ\|𝐳t=𝐳\]\.V\(\\mathbf\{z\},t\)=\\inf\_\{\\mathbf\{u\}\}\\mathbb\{E\}\\left\[\\Phi\(\\mathbf\{z\}\_\{1\}\)\+\\int\_\{t\}^\{1\}\\frac\{1\}\{2\}\\\|\\mathbf\{u\}\\\|^\{2\}d\\tau\\bigg\|\\mathbf\{z\}\_\{t\}=\\mathbf\{z\}\\right\]\.\(15\)According to theDynamic Programming Principlein continuous time, the value functionVVmust satisfy the Hamilton\-Jacobi\-Bellman \(HJB\) partial differential equation:

−∂V∂t=inf𝐮ℋ\(𝐳,𝐮,∇𝐳V\),\-\\frac\{\\partial V\}\{\\partial t\}=\\inf\_\{\\mathbf\{u\}\}\\mathcal\{H\}\(\\mathbf\{z\},\\mathbf\{u\},\\nabla\_\{\\mathbf\{z\}\}V\),\(16\)where the Hamiltonianℋ\\mathcal\{H\}represents the total energy of the system:

ℋ\(𝐳,𝐮,∇𝐳V\)=12‖𝐮‖2⏟Kinetic\+𝐮⋅∇𝐳V⏟Potential Interaction\.\\mathcal\{H\}\(\\mathbf\{z\},\\mathbf\{u\},\\nabla\_\{\\mathbf\{z\}\}V\)=\\underbrace\{\\frac\{1\}\{2\}\\\|\\mathbf\{u\}\\\|^\{2\}\}\_\{\\text\{Kinetic\}\}\+\\underbrace\{\\mathbf\{u\}\\cdot\\nabla\_\{\\mathbf\{z\}\}V\}\_\{\\text\{Potential Interaction\}\}\.\(17\)
The Optimal Control Law\.The optimal control𝐮∗\\mathbf\{u\}^\{\*\}is obtained by minimizing the Hamiltonian with respect to𝐮\\mathbf\{u\}\. Setting the gradient∇𝐮ℋ=0\\nabla\_\{\\mathbf\{u\}\}\\mathcal\{H\}=0, we derive the analytical form of the optimal controller:

𝐮\+∇𝐳V=0⟹𝐮∗\(𝐳,t\)=−∇𝐳V\(𝐳,t\)\.\\mathbf\{u\}\+\\nabla\_\{\\mathbf\{z\}\}V=0\\implies\\mathbf\{u\}^\{\*\}\(\\mathbf\{z\},t\)=\-\\nabla\_\{\\mathbf\{z\}\}V\(\\mathbf\{z\},t\)\.\(18\)
Manifestation in Text Generation: The Semantic Gradient Flow\.Equation \([18](https://arxiv.org/html/2605.14531#A1.E18)\) reveals the physical essence of our generation process\. In the context of LLMs, the value functionV\(𝐳,t\)V\(\\mathbf\{z\},t\)acts as a time\-varyingSemantic Potential Field:

- •Potential Landscape:High values ofVVcorrespond to regions of low semantic coherence or syntactic errors; low values correspond to the data manifold \(valid text\)\.
- •Gradient Guidance:The optimal control law𝐮∗=−∇𝐳V\\mathbf\{u\}^\{\*\}=\-\\nabla\_\{\\mathbf\{z\}\}Vdictates that the latent state should strictly follow the direction ofsteepest semantic descent\.

Consequently, our Manta\-LM functions as anOptimal Closed\-Loop Controller\. Unlike AR models that predict the next token from history alone, our controller uses the global gradient of the semantic potential field \(−∇V\-\\nabla V\), steering the flow to actively minimize the discrepancy between the current latent state and valid target text\.

### A\.2Language Models as Sub\-optimal Controller

Under the continuous\-control view, AR and Discrete Diffusion from the optimal control law𝐮∗=−∇V\\mathbf\{u\}^\{\*\}=\-\\nabla Vare not merely architectural choices but fundamental violations of the well\-posedness conditions for dynamical systems\. We analyze three resulting issues:Energy Divergence,Adjoint State Vanishing, andGradient Absence\.

##### 1\. Trajectory Singularity and System Stiffness\.

Optimal transport mandates minimizing the kinetic energy action𝒜=12∫01‖𝐮t‖2𝑑t\\mathcal\{A\}=\\frac\{1\}\{2\}\\int\_\{0\}^\{1\}\\\|\\mathbf\{u\}\_\{t\}\\\|^\{2\}dt\.

- •Impulsive AR dynamics:For Autoregressive models, substituting the impulsive control law \(Eq\.[4](https://arxiv.org/html/2605.14531#S3.E4)\) into the action functional yields a divergence: 𝒜AR∝∫01‖∑𝐟kδ\(t−tk\)‖2𝑑t→∞\.\\mathcal\{A\}\_\{\\text\{AR\}\}\\propto\\int\_\{0\}^\{1\}\\left\\\|\\sum\\mathbf\{f\}\_\{k\}\\delta\(t\-t\_\{k\}\)\\right\\\|^\{2\}dt\\to\\infty\.\(19\)Since theL2L^\{2\}norm of a Dirac delta is undefined \(infinite energy\), the trajectory exhibitsInfinite Instantaneous Curvature\. In numerical analysis, this characterizes a highly stiff system \(λmax→∞\\lambda\_\{\\max\}\\to\\infty\), making it poorly suited to parallel ODE solvers and reflecting the serial nature of AR integration\.
- •Lipschitz Explosion:For continuous diffusion on raw embeddings, the irregular topology implies that the gradient field∇log⁡pt\\nabla\\log p\_\{t\}is not Lipschitz continuous\. The local Lipschitz constantLLexplodes: L=sup𝐳≠𝐲‖𝐮\(𝐳\)−𝐮\(𝐲\)‖‖𝐳−𝐲‖≫1\.L=\\sup\_\{\\mathbf\{z\}\\neq\\mathbf\{y\}\}\\frac\{\\\|\\mathbf\{u\}\(\\mathbf\{z\}\)\-\\mathbf\{u\}\(\\mathbf\{y\}\)\\\|\}\{\\\|\\mathbf\{z\}\-\\mathbf\{y\}\\\|\}\\gg 1\.\(20\)Stability of numerical integration requires step sizesΔt<2/L\\Delta t<2/L\. AsL→∞L\\to\\infty,Δt→0\\Delta t\\to 0, causing the Number of Function Evaluations \(NFE\) to diverge as shown in[Figure5](https://arxiv.org/html/2605.14531#S5.F5)\.

This stiffness helps explain the serial or high\-NFE behavior observed in these settings: highly non\-smooth dynamics \(singular\) are difficult for parallel continuous solvers \(like Picard iteration\) to integrate efficiently\.

##### 2\. Open\-Loop Drift via Adjoint State Vanishing\.

In control theory, the optimality of a decision at timettis determined by thePrediction HorizonHH\. A general Model Predictive Control \(MPC\) objective minimizes the cost\-to\-go over the interval\[t,t\+H\]\[t,t\+H\]:

JH\(𝐮t\)=𝔼\[∫tt\+H12‖𝐮τ‖2𝑑τ\+Φ\(𝐳t\+H\)\]\.J\_\{H\}\(\\mathbf\{u\}\_\{t\}\)=\\mathbb\{E\}\\left\[\\int\_\{t\}^\{t\+H\}\\frac\{1\}\{2\}\\\|\\mathbf\{u\}\_\{\\tau\}\\\|^\{2\}d\\tau\+\\Phi\(\\mathbf\{z\}\_\{t\+H\}\)\\right\]\.\(21\)whereΦ\(𝐳t\+H\)\\Phi\(\\mathbf\{z\}\_\{t\+H\}\)denotes theTerminal Potential, which encodes the global semantic coherence and validity of the final outcome\. In optimal control, trajectory stability is governed by theAdjoint Equationfor the co\-state𝐩t\\mathbf\{p\}\_\{t\}\(the sensitivity of the terminal costΦ\\Phi\):

𝐩˙t=−∇𝐳ℋ\(𝐳t,𝐮t,𝐩t\),𝐩1=−∇𝐳Φ\(𝐳1\)\.\\dot\{\\mathbf\{p\}\}\_\{t\}=\-\\nabla\_\{\\mathbf\{z\}\}\\mathcal\{H\}\(\\mathbf\{z\}\_\{t\},\\mathbf\{u\}\_\{t\},\\mathbf\{p\}\_\{t\}\),\\quad\\mathbf\{p\}\_\{1\}=\-\\nabla\_\{\\mathbf\{z\}\}\\Phi\(\\mathbf\{z\}\_\{1\}\)\.\(22\)- •Adjoint Vanishing:Global optimality requiresH=1−tH=1\-t\(solving until the terminal state\)\. We characterize AR generation as a degenerate MPC policy where the horizon is truncated to a single discrete step with horizonH=1H=1\. It truncates the backward pass, effectively forcing𝐩t≡𝟎\\mathbf\{p\}\_\{t\}\\equiv\\mathbf\{0\}fort<1t<1\.
- •Lyapunov Instability:Without the restoring force provided by the adjoint feedback𝐮fb∝𝐩t\\mathbf\{u\}\_\{\\text\{fb\}\}\\propto\\mathbf\{p\}\_\{t\}, the error dynamics𝐞t=𝐳t−𝐳t∗\\mathbf\{e\}\_\{t\}=\\mathbf\{z\}\_\{t\}\-\\mathbf\{z\}^\{\*\}\_\{t\}become unstable\. Any perturbationϵ\\epsilon\(e\.g\., quantization noise\) grows exponentially, leading to irreversible drift \(hallucination\): ddt‖𝐞t‖\>0\(Open\-Loop Instability\)\.\\frac\{d\}\{dt\}\\\|\\mathbf\{e\}\_\{t\}\\\|\>0\\quad\(\\text\{Open\-Loop Instability\}\)\.\(23\)Geometrically, this resembles a ball rolling along a narrow ridge without a restoring force—any deviation pushes it irreversibly off the data manifold \(hallucination\)\.

##### 3\. Geometric Blindness via Metric Singularity\.

Efficient optimization requires aGradient Flowon a Riemannian manifold\(ℳ,g\)\(\\mathcal\{M\},g\)\.

- •Gradient Absence:To clarify the absence of a smooth gradient on the discrete lattice𝒟\\mathcal\{D\}, we invoke the definition of the Fréchet derivative on Riemannian manifolds\. On a smooth manifold, the gradient∇V\(𝐳\)∈T𝐳ℳ\\nabla V\(\\mathbf\{z\}\)\\in T\_\{\\mathbf\{z\}\}\\mathcal\{M\}is the unique vector satisfying the linearization conditionlim‖𝐝‖→0‖V\(𝐳\+𝐝\)−V\(𝐳\)−⟨∇V,𝐝⟩‖/‖𝐝‖=0\\lim\_\{\\\|\\mathbf\{d\}\\\|\\to 0\}\\\|V\(\\mathbf\{z\}\+\\mathbf\{d\}\)\-V\(\\mathbf\{z\}\)\-\\langle\\nabla V,\\mathbf\{d\}\\rangle\\\|/\\\|\\mathbf\{d\}\\\|=0\. However,𝒟\\mathcal\{D\}possesses the discrete topology where the metric is lower\-bounded by‖𝐲−𝐳‖≥1\\\|\\mathbf\{y\}\-\\mathbf\{z\}\\\|\\geq 1for𝐲≠𝐳\\mathbf\{y\}\\neq\\mathbf\{z\}, precluding infinitesimal displacements\. Thus, the smooth differential operator required by our HJB formulation is not directly available on𝒟\\mathcal\{D\}\. Furthermore, due to the high\-frequency discontinuities of the semantic energy landscapeVVacross discrete tokens, no single vector𝐯\\mathbf\{v\}can satisfy the first\-order Taylor approximation for the local neighborhood, formally implying that∄𝐯\\nexists\\ \\mathbf\{v\}such thatV\(𝐳\+𝐝\)−V\(𝐳\)≈⟨𝐯,𝐝⟩V\(\\mathbf\{z\}\+\\mathbf\{d\}\)\-V\(\\mathbf\{z\}\)\\approx\\langle\\mathbf\{v\},\\mathbf\{d\}\\rangleholds, thereby confirming the structural absence of gradient guidance\. ∄𝐯∈T𝐳ℳs\.t\.⟨𝐯,𝐝⟩≈V\(𝐳\+𝐝\)−V\(𝐳\)\.\\nexists\\ \\mathbf\{v\}\\in T\_\{\\mathbf\{z\}\}\\mathcal\{M\}\\quad\\text\{s\.t\.\}\\quad\\langle\\mathbf\{v\},\\mathbf\{d\}\\rangle\\approx V\(\\mathbf\{z\}\+\\mathbf\{d\}\)\-V\(\\mathbf\{z\}\)\.\(24\)
- •Combinatorial Fallback:Without a smooth descent direction−∇V\-\\nabla V, the system relies on stochastic transition search\. This degrades the convergence rate from linear/superlinear \(Gradient Descent\) to sublinear \(Random Walk\), manifesting as the efficiency\-quality trade\-off\.

### A\.3Discussion

![Refer to caption](https://arxiv.org/html/2605.14531v1/x7.png)Figure 7:Model structure and pipeline\.Functioning as an optimal controller, our framework not only subsumes the capabilities of standard AR models but also unlocks some entirely novel tasks\.

#### A\.3\.1Universal Generation via Boundary Value Problems

Unlike Autoregressive models, which are architecturally constrained to causal generation \(Prefix→\\toSuffix\), our flow\-based formulation treats text generation as a genericBoundary Value Problem \(BVP\)\. In our control framework, different generation tasks \(e\.g\., continuation, in\-filling\) correspond to solving the same ODE but with different Boundary Conditions\. We define a masking operatorℳ\\mathcal\{M\}that partitions the latent state into a ”Constraint” region𝐳ctx\\mathbf\{z\}\_\{\\text\{ctx\}\}\(fixed boundary condition\) and a ”State” region𝐳flow\\mathbf\{z\}\_\{\\text\{flow\}\}\(free variable\)\. The vector field is then conditioned on the constraints:

d𝐳flowdt=vθ\(𝐳flow∪𝐳ctx,t\)\.\\frac\{d\\mathbf\{z\}\_\{\\text\{flow\}\}\}\{dt\}=v\_\{\\theta\}\(\\mathbf\{z\}\_\{\\text\{flow\}\}\\cup\\mathbf\{z\}\_\{\\text\{ctx\}\},t\)\.\(25\)
During training, we stochastically simulate diverse boundary conditions \(Causal, In\-filling, Prefix\) as shown in[Figure7](https://arxiv.org/html/2605.14531#A1.F7)\. This unifies all text generation tasks into a single mathematical formulation:trajectory optimization under partial state constraints\.

#### A\.3\.2Topological Padding for Variable Lengths

To handle variable sequence lengths within this rigid ODE framework, we introduce a learnable<PAD\>token embedding to represent the”Null State”\(a region of zero potential and void semantics\)\. We imply a ”Topological Boundary” condition where the flow must converge to the Null Attractor \(Null State\) in padding regions\. By training the model to flow from noise to<PAD\>in these areas as shown in[Figure7](https://arxiv.org/html/2605.14531#A1.F7), the controller learns the manifold’s boundary\. During inference, this allows the ODE solver to dynamically determine the sequence length by identifying when the flow state converges to the Null attractor\.

### A\.4Latent Control Space and TextVAE Evaluation\.

As the core module to build the control\-friendly latent control space, TextVAE’s reconstruction fidelity and robustness are critical to the performance of text generation\. Our TextVAE trained on a diverse dataset to ensure broad generalization, coveringCode\(OpenCoder\(Huang et al\.,[2025](https://arxiv.org/html/2605.14531#bib.bib21)\), Infinity\-Instruct\(Li et al\.,[2025a](https://arxiv.org/html/2605.14531#bib.bib24)\)\),Mathematics\(OpenMath\(Moshkov et al\.,[2025](https://arxiv.org/html/2605.14531#bib.bib31)\), GSM8K\(Cobbe et al\.,[2021](https://arxiv.org/html/2605.14531#bib.bib9)\)\), andGeneral NLP\(Wiki, Common Crawl, LM1B\(Chelba et al\.,[2013](https://arxiv.org/html/2605.14531#bib.bib6)\)\)\.

#### A\.4\.1TextVAE Architecture and Objective

The TextVAE follows the design spirit of Deep Compression Autoencoders\(Chen et al\.,[2024](https://arxiv.org/html/2605.14531#bib.bib7)\), adapted from 2D images to 1D token sequences\. The encoder and decoder are symmetric, fully parallel convolutional networks\. We do not use self\-attention inside the VAE: each latent is computed from a local window of tokens, which preserves token\-to\-latent alignment and avoids global cross\-position entanglement before the diffusion controller\.

The VAE objective combines reconstruction, posterior regularization, and feature\-stability terms:

ℒVAE=ℒrec\+β\(t\)ℒKL\+λstabℒstab\.\\mathcal\{L\}\_\{\\mathrm\{VAE\}\}=\\mathcal\{L\}\_\{\\mathrm\{rec\}\}\+\\beta\(t\)\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\+\\lambda\_\{\\mathrm\{stab\}\}\\mathcal\{L\}\_\{\\mathrm\{stab\}\}\.\(26\)Forℒrec\\mathcal\{L\}\_\{\\mathrm\{rec\}\}, the decoder first maps latent states to continuous token embeddings\. Vocabulary logits are then computed by dot product with the shared embedding matrix, followed by learnable scale and bias terms\. Binary cross\-entropy is applied between these logits and one\-hot token targets; empirically, this converged faster and gave better reconstruction than standard cross\-entropy in our setting\. The KL term regularizes the Gaussian posteriorqϕ\(𝐳∣x\)q\_\{\\phi\}\(\\mathbf\{z\}\\mid x\)toward𝒩\(0,I\)\\mathcal\{N\}\(0,I\)with a progressive scheduleβ\(t\)\\beta\(t\)to reduce posterior collapse\. The stability loss regularizes intermediate features and improves numerical robustness under latent perturbations\.

Decoding is fully parallel\. Given a latent sequence, the convolutional decoder outputs continuous embeddings at all positions, projects them to vocabulary logits using the shared token embedding matrix with learnable scale/bias, applies softmax, and removes padding tokens when producing the final text\. No autoregressive dependency is used in the VAE decoder\.

#### A\.4\.2Local Integral Operators and Downsampling

The convolutional blocks in TextVAE can be viewed as Local Integral Operators: the output at each latent position aggregates information from a finite local neighborhood through a translation\-equivariant kernel\. This locality is important for structured generation because token\-level composition, such as concatenating constrained and free segments, remains reflected in the latent representation\.

Downsampling is configurable\. In the4×4\\timessetting used in our main experiments, the encoder applies two downsampling stages and maps a length\-LLtoken sequence to a length\-L/4L/4latent sequence\. Downsampling reduces sequence length while preserving local order and approximate local token\-to\-latent alignment, rather than collapsing the sentence into a globally mixed representation\.

For boundary\-conditioned generation, we first group tokens according to the VAE compression factor and pad within each group as needed\. For example, a constraint patternC\_\_CCC\_\_\_\_, whereCdenotes constrained tokens and underscores denote free slots, becomes\|C000\|\_\_00\|CCC0\|\_\_\_\_\|under4×4\\timesgrouping, where0denotes<PAD\>\. The convolutional encoder maps each group to its corresponding latent position\. Groups containing constrained tokens are encoded as context latents, while fully unconstrained groups are initialized from Gaussian noise\. We use separate embeddings to mark context and free latents before passing them to the global controller, and the decoder removes<PAD\>tokens after generation\. This procedure supports both initial value problems \(prefix\-only constraints\) and boundary value problems \(prefix, suffix, or arbitrary constrained spans\)\.

Table 8:Reconstruction and robustness of the TextVAE across heterogeneous benchmarks\.We report token\-level reconstruction accuracy at1×/4×1\\times/4\\timeslatent compression rates\. Robustness is measured by the estimated Lipschitz constants of the encoder and decoder, reflecting the smoothness of the learned latent manifold\. Results demonstrate that our VAE achieves near\-lossless compression while maintaining geometric stability\.We evaluate performance across 19 benchmarks focusing on two metrics: \(1\)Reconstruction Accuracyat1×1\\timesand4×4\\timescompression; and \(2\)Robustness, measured by the Lipschitz constant \(σ=0\.01\\sigma=0\.01\)\. As shown in Table[8](https://arxiv.org/html/2605.14531#A1.T8), TextVAE achieveslossless reconstruction \(1\.00\)at1×1\\timesand maintains\>99\.3%\>99\.3\\%accuracy at4×4\\times\. Furthermore, low Lipschitz constants indicate a smooth latent space, which is critical for ensuring that perturbations during diffusion denoising yield semantically consistent decodings\.

### A\.5Additional Evaluation Details

#### A\.5\.1Continuous Diffusion on Zero\-Shot PPL

Continuous diffusion baselines are primarily trained and evaluated under conditional generation protocols\. When evaluated on the unconditional zero\-shot PPL setting used in[Table2](https://arxiv.org/html/2605.14531#S5.T2), they underperform AR and discrete diffusion models because they are not optimized to model the unconditional data distribution\. We therefore report them separately here and keep the main comparison aligned with the established discrete diffusion protocol\.

Table 9:Zero\-shot perplexity \(↓\\downarrow\) of continuous diffusion baselines under the unconditional density\-estimation setting\.
#### A\.5\.2Training and Inference Cost

Unless otherwise specified, inference measurements use a single NVIDIA RTX A6000 with batch size 64 and sequence length 128\.

Table 10:Practical training and inference cost of Manta\-LM\.![Refer to caption](https://arxiv.org/html/2605.14531v1/x8.png)Figure 8:Analysis on interplay between CFG guidance strength and integration fidelity

### A\.6Interplay between Guidance Strength and Integration Fidelity\.

We investigate the joint impact of Classifier\-Free Guidance \(CFG\) strengthwwand sampling stepsNNon generation quality in[Figure8](https://arxiv.org/html/2605.14531#A1.F8)\. Our analysis reveals three distinct geometric regimes:

- •Under\-Guided Regime \(w=1\.0w=1\.0\):Performance is stable invariant toNNbut consistently suboptimal\. The latent flow is smooth but lacks sufficient conditional force to drive the trajectory toward high\-fidelity regions\.
- •Optimal Regime \(w∈\[3\.0,5\.0\]w\\in\[3\.0,5\.0\]\):This setting achieves the best quality\-efficiency trade\-off\. Metrics saturate rapidly \(within 20–30 steps\), indicating that the vector field is sufficiently aligned with the condition while remaining smooth enough for coarse\-step integration\.
- •Over\-Guided Regime \(w≥7\.0w\\geq 7\.0\):We observe a sharp performance collapse at lowNN\. While quality recovers with finer discretization, this sensitivity highlights aStiffness Constraints\.

Geometric Interpretation\.Unlike AR models where guidance acts as a logit bias, in Latent Optimal Control, CFG directly reshapes the velocity field:vw=vuncond\+w\(vcond−vuncond\)v\_\{w\}=v\_\{\\text\{uncond\}\}\+w\(v\_\{\\text\{cond\}\}\-v\_\{\\text\{uncond\}\}\)\. Increasingwwamplifies the conditional gradient, effectively increasing the local Lipschitz constant \(curvature\) of the dynamics\. Strong guidance renders the ODEstiff, necessitating fine\-grained temporal discretization \(highNN\) to avoid numerical divergence\. Consequently, practical deployment requires balancing conditional strength with integration stability, favoring the moderate regime\.

![Refer to caption](https://arxiv.org/html/2605.14531v1/x9.png)Figure 9:Step\-by\-step conditional generation process of Manta\-LM on a paraphrase task\. Given the input sentence “what was the best day of your life, and what happened?”, the figure visualizes the intermediate generation trajectories of Manta\-LM across diffusion steps\.![Refer to caption](https://arxiv.org/html/2605.14531v1/x10.png)Figure 10:Step\-by\-step conditional generation process of Manta\-LM on a paraphrase task\. Given the input sentence “how can i be a good geologist?”, the figure visualizes the intermediate generation trajectories of Manta\-LM across diffusion steps\.![Refer to caption](https://arxiv.org/html/2605.14531v1/x11.png)Figure 11:Visualizing error correction capabilities across different models\. Red text indicates corrupted or erroneous tokens introduced by noise\. while yellow text denotes tokens that are semantically consistent with the ground\-truth text but differ in surface form\.![Refer to caption](https://arxiv.org/html/2605.14531v1/x12.png)Figure 12:Qualitative examples of the text infilling task\. Text inbluerepresents the provided prefix and suffix, while text inblackdenotes the model’s generated results\.![Refer to caption](https://arxiv.org/html/2605.14531v1/x13.png)Figure 13:Qualitative examples of the text infilling task\. Text inbluerepresents the provided prefix and suffix, while text inblackdenotes the model’s generated results\.![Refer to caption](https://arxiv.org/html/2605.14531v1/x14.png)Figure 14:Qualitative examples of the text infilling task\. Text inbluerepresents the provided prefix and suffix, while text inblackdenotes the model’s generated results\.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

Similar Articles

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion

@volokuleshov: New blog post: How to Build a Diffusion Language Model. Diffusion LLMs went from open problem to reality in 2 years (Me…

Diffusion Language Models: An Experimental Analysis

Submit Feedback

Similar Articles

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation
Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion
@volokuleshov: New blog post: How to Build a Diffusion Language Model. Diffusion LLMs went from open problem to reality in 2 years (Me…
Diffusion Language Models: An Experimental Analysis