Generative-Model Predictive Planning for Navigation in Partially Observable Environments
Summary
This paper introduces BeliefDiffusion, a framework combining diffusion models to represent multimodal belief distributions and Model Predictive Control for planning in partially observable environments, achieving better navigation success and path efficiency than baselines.
View Cached Full Text
Cached at: 06/18/26, 05:41 AM
# Generative-Model Predictive Planning for Navigation in Partially Observable Environments
Source: [https://arxiv.org/html/2606.18888](https://arxiv.org/html/2606.18888)
\\authornote
These authors contributed equally to this work\.\\affiliation\\departmentDepartment of Computer Science\\institutionUniversity of Manchester\\cityManchester\\countryUnited Kingdom
\\authornotemark
\[1\]\\affiliation\\departmentDepartment of Computer Science\\institutionAalto University\\cityEspoo\\countryFinland
\\authornotemark
\[1\]\\affiliation\\departmentDepartment of Computer Science\\institutionUniversity of Manchester\\cityManchester\\countryUnited Kingdom
\\affiliation\\department
Department of Computer Science\\institutionUniversity of Manchester\\cityManchester\\countryUnited Kingdom
\\affiliation\\department
Department of Computer Science\\institutionAalto University\\cityEspoo\\countryFinland
###### Abstract\.
Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision\-making with limited sensory information in unknown environments\. Belief\-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high\-dimensional cases with perceptual aliasing\. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long\-term planning\. In this paper, we introduce*BeliefDiffusion*, a novel framework that combines the benefits of both generation and planning\.*BeliefDiffusion*leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control \(MPC\) to simultaneously plan ahead\. It consists of two steps: \(1\) Imagining plausible environment configurations based on observation history and \(2\) Planning efficient navigation strategies across an aggregated configurations\. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model\-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency\. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings\.
###### Key words and phrases:
Diffusion models and Partially observable environments and Model predictive control
## 1\.Introduction
We study point\-goal navigation in partially observable environments where an agent must reach a specified goal location without prior knowledge of the environment structure \(i\.e\., the map\)\. In our setting, the agent only observes a limited local region around itself along with its relative position with goal coordinates\. Navigation of this kind presents a fundamental challenge in many domains, e\.g\., roboticsross2008bayesianand autonomous drivingort2018autonomous\. In this setting, agents must make decisions based on limited or incomplete sensory information about their surroundings\. One traditional approach to solving this challenge is through Partially Observable Markov Decision Processes \(POMDPs\), which assumes that the agent cannot directly observe the true state but instead needs to maintain a probability “belief” of all possible states based on the observations received so far – also known as Belief MDP\. By constructing representations of environmental states that are consistent with observations, agents can make more informed decisions under uncertainty\. Since the belief state serves as a sufficient statistic of the agent’s history, there is no need to retain a growing record of past actions and observations, allowing decisions to rely on a fixed\-size representation\. Moreover, by maximizing the expected total discounted reward over the space of belief states, agents can achieve optimal decision\-making\. Techniques such as simultaneous localization and mapping \(SLAM\) have been used to build belief representations of environmentsdurrant2006simultaneous\. More recent approaches leverage neural network\-based world models to infer environmental states from observation historieszou2024diffbev;zhang2024imagine\.
However, these belief\-based methods often simplify belief representations into a unimodal point estimation, i\.e\., most likely statezhang2024imagine, sacrificing the probabilistic reasoning necessary for optimal navigation in partially observable settings\. This limitation is especially problematic in environments with perceptual aliasing and restricted sensor ranges, where multiple plausible configurations exist\. In such cases, unimodal approximations fail to capture the true multimodal nature of the belief space, leading to suboptimal decision\-makinggupta2024efficient\. Furthermore, these methods struggle to cope with high\-dimensional belief spaces and require explicit environmental models that are difficult to acquire in complex, unknown settingskarkus2017qmdp;hoerger2019pomdp\. Previous attempts to overcome these limitations include the use of generative models, which stitch navigation trajectories directly without explicitly planningha2018world;janner2022planning, or reinforcement learningwijmans2019dd, which focuses on model\-free and end\-to\-end training\. By learning a structured latent space, generative models enhance decision\-making efficiency without directly interacting with the real environment\. They effectively encode complex state distributionshafner2019dream, capturing a diverse range of possible states\. This makes them particularly well\-suited for large, high\-dimensional environments\. Despite their advantages, these models still require expert demonstrationsyu2024trajectory, or extensive offline trajectoriesjanner2022planning, and often ignore the need for long\-term and robust planning in navigationhong2023diffused\.
In this paper, we combine the benefits of both worlds, generation and planning, and introduce*BeliefDiffusion*, a novel framework that leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control \(MPC\) to simultaneously plan ahead\. BeliefDiffusion follows a two\-step process: \(1\) Imagining plausible environment configurations based on observation history and \(2\) Planning efficient navigation strategies across an aggregated configurations\. First, our model generalizes belief estimation to distribution generation, producing possible map layouts that are consistent with past observations\. Our key insight is to treat map layout inference as a conditional generative modeling problem, in order to capture the multimodal distribution of plausible environments\. By learning to generate diverse and realistic local maps, our model constructs an explicit belief representation over possible worlds\.
Figure 1\.BeliefDiffusion explicitly characterises multimodal belief distributions using diffusion models to generate plausible environment configurations from partial observations\. We then leverage Model Predictive Control, MPC, to plan robust navigation strategies across these beliefs, enabling efficient navigation in unknown environments\.Second, we integrate this belief representation into an MPC planning framework\. We iteratively minimize the cost based on a dynamic model and action proposals derived from the distribution of potential map layouts, and then select a locally optimal trajectory at each timestep\. By performing path planning jointly across sampled layouts, we identify robust and efficient navigation strategies in environments where accurate long\-horizon planning is impossible\.[Figure1](https://arxiv.org/html/2606.18888#S1.F1)illustrates the key steps of BeliefDiffusion, from initial observations to belief sampling and path planning\.
Through extensive experimentation in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms existing methodologies in both navigation success rate and path efficiency\. Our results show that by explicitly incorporating reasoned belief space into MPC, our framework enables more informed decision\-making in complex, unknown environments without requiring the extensive training data needed by reinforcement learning approaches\. The main contributions of our work include:
- •A novel diffusion\-based approach for explicitly modeling multimodal belief states in navigation,
- •A belief\-space model\-predictive planning framework that integrates generated map layouts for robust decision\-making in navigation,
- •Strong empirical evidence demonstrating improved sample efficiency and performance compared to both model\-free RL methods and approaches that rely on unimodal approximations\.
In essence, our method first*imagines*several plausible versions of the unseen parts of the world, then*plans*actions that work well across those possibilities\. Concretely, the diffusion model takes the agent’s partial experience, like a sketch of corridors glimpsed so far, and proposes multiple likely local maps rather than betting on a single guess\. MPC then evaluates short action sequences across this small bundle of maps and picks the move that keeps progress towards the goal while avoiding routes that would fail on any reasonable alternative\. This two\-step “imagine then plan” loop is repeated as new observations arrive, so poor hypotheses are discarded and good ones reinforced, much like a sat\-nav that continually recalculates as roads are revealed\. A simple analogy: if two passages might exist ahead, we prepare for both, taking the turn that remains safe and useful whichever passage proves real\. This makes the agent robust to perceptual aliasing, reduces wrong turns, and saves data, because we do not need to learn a monolithic policy for every situation, only to generate a handful of sensible maps and choose actions that hedge across them\. In short, modelling uncertainty explicitly, then planning*through*that uncertainty, is why BeliefDiffusion succeeds in challenging, partially observable navigation\.
## 2\.Related Work
#### POMDP for Partial Observability\.
Navigating in unseen environments presents significant challenges due to partial observability and unknown environmental dynamics\. Partially Observable Markov Decision Processes \(POMDPs\)kaelbling1998planningprovide a mathematical framework for decision\-making under partial observability by maintaining belief states over possible environmental configurations\. This enables agents to perform belief\-space planning, reasoning over latent state distributions to account for perceptual aliasing and ambiguous observations\. The POMDP framework has shown success in various robotic tasks involving partial observability, particularly in navigation contextsmartinez2009bayesian;lauri2016planning\. Despite their theoretical elegance, classical POMDP solvers face substantial scalability challenges\. These algorithms struggle with high\-dimensional belief stateshauskrecht2000valueand typically require explicit transition models, which necessitate extensive data collection in complex, unknown environmentskarkus2017qmdp;hoerger2019pomdp\. While reinforcement learning methods such as PPOschulman2017proximal;wijmans2019ddoffer alternative approaches, they frequently suffer from data inefficiency and training instability\. Recent neural approachesramakrishnan2022poni;zhang2024imagineattempt to model belief distributions in POMDPs but fail to address a critical challenge: capturing the multimodal nature of belief states in map environments\. Instead, these methods tend to collapse to the most probable latent state or rely on a single sample from their model, rather than maintaining the full belief distribution necessary for optimal decision\-making under partial observability\.
#### World Models for Environmental Understanding\.
Generative models play an increasingly important role in decision\-making problems, commonly referred to as World Models\. These models serve multiple functions: \(1\) learning abstract representations of environments\(ha2018world\), \(2\) predicting future stateshansen2022modem, and finally providing contextual information for decision\-making processes\. Recent advances in generative techniques, particularly diffusion models, have been applied to future state predictionjanner2022planningin complex scenarios including autonomous drivingzheng2025diffusion\. These approaches effectively offer probabilistic frameworks for modeling future trajectories\. However, most existing methods are constrained in scope, primarily focusing on predicting observation\-wise states rather than developing holistic environmental representations\. More succinct and task\-oriented representations, such as Occupancy Gridswang2024occsoraor Bird’s\-Eye View representationszou2024diffbev, often prove more effective, particularly for navigation tasks\. While these specialized models provide highly descriptive information, they typically lack direct integration with decision\-making processes—a crucial component for effective navigation\.
## 3\.Preliminaries
In this section, we provide a brief introduction to Partially Observable Markov Decision Processes \(POMDP\), Model Predictive Control, and diffusion models with classifier\-free guidance\.
### 3\.1\.Partially Observable Markov Decision Process \(POMDP\)
A Partially Observable Markov Decision Process \(POMDP\) is formally defined as a 6\-tuple⟨𝒮,𝒜,𝒪,𝒯,𝒵,ℛ⟩\\langle\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{O\},\\mathcal\{T\},\\mathcal\{Z\},\\mathcal\{R\}\\rangle\. Here,𝒮\\mathcal\{S\}denotes the set of states, while𝒜\\mathcal\{A\}represents the set of available actions\. The agent receives observations from the set𝒪\\mathcal\{O\}, which provide partial information about the underlying state\. The system evolves according to the transition probability function𝒯\(s,a,s′\)=P\(st\+1=s′∣st=s,at=a\)\\mathcal\{T\}\(s,a,s^\{\\prime\}\)=P\(s\_\{t\+1\}=s^\{\\prime\}\\mid s\_\{t\}=s,a\_\{t\}=a\)\. The observation probability function𝒵\(s′,a,o\)=P\(ot\+1=o∣st\+1=s′,at=a\)\\mathcal\{Z\}\(s^\{\\prime\},a,o\)=P\(o\_\{t\+1\}=o\\mid s\_\{t\+1\}=s^\{\\prime\},a\_\{t\}=a\), specifies the likelihood of receiving observationoogiven that the system has transitioned to states′s^\{\\prime\}after taking actionaa\. Finally, the reward functionℛ\(s,a\)\\mathcal\{R\}\(s,a\)determines the immediate reward obtained upon executing actionaain statess\. A POMDP problem can be transformed into a fully observable Markov Decision Process \(MDP\) over belief states, often referred to as a belief MDP\. More formally, the state space consists of all possible belief states, denoted asb∈Δ\(𝒮\)b\\in\\Delta\(\\mathcal\{S\}\), whereΔ\(𝒮\)\\Delta\(\\mathcal\{S\}\)represents the set of all probability distributions over the state space𝒮\\mathcal\{S\}\. The action space remains the same as in the original POMDP, i\.e\.,𝒜\\mathcal\{A\}\. The transition function,τ\(b,a,b′\)\\tau\(b,a,b^\{\\prime\}\), gives the probability of transitioning to belief stateb′b^\{\\prime\}after taking actionaain belief statebb\. The reward function is defined asρ\(b,a\)=∑s∈𝒮b\(s\)ℛ\(s,a\)\\rho\(b,a\)=\\sum\_\{s\\in\\mathcal\{S\}\}b\(s\)\\mathcal\{R\}\(s,a\), whereb\(s\)b\(s\)is the belief probability of statessandℛ\(s,a\)\\mathcal\{R\}\(s,a\)is the reward function in the original POMDP\. When the agent takes actionaain belief statebband receives observationoo, the belief state is updated using Bayes’ rule:b′\(s′\)=P\(o∣s′,a\)∑s∈𝒮P\(s′∣s,a\)b\(s\)P\(o∣b,a\)b^\{\\prime\}\(s^\{\\prime\}\)=\\frac\{P\(o\\mid s^\{\\prime\},a\)\\sum\_\{s\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\\mid s,a\)b\(s\)\}\{P\(o\\mid b,a\)\}whereP\(o∣b,a\)P\(o\\mid b,a\)is given by:P\(o∣b,a\)=∑s′∈𝒮P\(o∣s′,a\)∑s∈𝒮P\(s′∣s,a\)b\(s\)P\(o\\mid b,a\)=\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\(o\\mid s^\{\\prime\},a\)\\sum\_\{s\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\\mid s,a\)b\(s\)\. Since the agent cannot directly observe the true state of the environment, it maintains a probability distribution over all possible states, known as a belief state\. At timett, the belief statebtb\_\{t\}is a distribution over the state space𝒮\\mathcal\{S\}, wherebt\(s\)b\_\{t\}\(s\)denotes the probability that the system is in states∈𝒮s\\in\\mathcal\{S\}\.
In a navigation task, an agent must move through an environment with an uncertain layout\. If there is only a finite set of possible maps, denoted asℳ=\{m1,m2,…,mk\}\\mathcal\{M\}=\\\{m\_\{1\},m\_\{2\},\.\.\.,m\_\{k\}\\\}, where each map represents a distinct arrangement of obstacles, and when the agent performs an action and moves, its observation is its current position,pobsp\_\{obs\}\. Then the state space𝒫×ℳ\\mathcal\{P\}\\times\\mathcal\{M\}is defined by the combination of the agent’s positionppand the actual mapmm\. Since the agent’s positionp′p^\{\\prime\}is fully observed after each actionaa, the uncertainty about its location is eliminated, and the belief state simplifies to a probability distribution overℳ\\mathcal\{M\}\.
### 3\.2\.Model Predictive Control
Model Predictive Control \(MPC\) is a model\-based method for optimal controlmorari1999model;schwenzer2021review\. The core characteristic of MPC is to optimize the reward over a receding horizon, where the objective is:a^t:t\+H=argmaxat:t\+H𝔼pθ\(st\+1:t\+H\|at:t\+H,s0:t\)\[rt\+1\+⋯\+rt\+H\]\\hat\{a\}\_\{t:t\+H\}=\\arg\\max\_\{a\_\{t:t\+H\}\}\\mathbb\{E\}\_\{p\_\{\\theta\}\(s\_\{t\+1:t\+H\}\|a\_\{t:t\+H\},s\_\{0:t\}\)\}\\left\[r\_\{t\+1\}\+\\dots\+r\_\{t\+H\}\\right\]\. Here, a learned dynamic modelpθp\_\{\\theta\}predicts future state distributions given a sequence of actions\.ata\_\{t\}represents the action sequence over the horizonHH,st\+1,…s\_\{t\+1\},\\dotsare the future states andrt\+1,…r\_\{t\+1\},\\dotsare the corresponding rewards\. Once the optimal action sequencea^t:t\+H\\hat\{a\}\_\{t:t\+H\}is acquired, MPC use a closed\-loop control method, executing the first actionata\_\{t\}and replanning the remaining actions based on new observations\. MPC offers two key advantages: 1\. Adaptability to changing environments – Since MPC iteratively updates its decisions based on new observations, it naturally reacts to uncertainties\. 2\. Data efficiency – Unlike reinforcement learning, MPC often requires less data because it does not rely on learning a full policy from scratch\.
Two closely related works to our approach areDiffuserjanner2022planningandDiffusion Model Predictive Control \(D\-MPC\)zhou2024diffusion\.Diffuseremploys a diffusion model to jointly model future state and action distributions, integrating action proposal and dynamics prediction into a single framework\. In contrast,D\-MPCseparates dynamics modeling and action proposal into two diffusion models, allowing for better adaptation to environmental changes\. Our method retains the advantages of separating dynamics modeling and action proposal but does so in the belief space using only a single diffusion model\. This approach further enhances data efficiency while preserving adaptability\.
### 3\.3\.Denoising diffusion probabilistic models \(DDPM\)
Denoising diffusion probabilistic models \(DDPM\) is a generative model which have two processes: a forward diffusion process and a reverse denoising process\. Given a sample point𝒙0\\bm\{x\}\_\{0\}from an unknown distributionp\(𝒙0\)p\(\\bm\{x\}\_\{0\}\), the forward diffusion process adds a small amount of Gaussian noise to the sample𝒙0\\bm\{x\}\_\{0\}inTTsteps, producing a sequence of noisy samples𝒙1,𝒙2,…,𝒙T\\bm\{x\}\_\{1\},\\bm\{x\}\_\{2\},\.\.\.,\\bm\{x\}\_\{T\}\. The step sizes are controlled by a variance schedule\{βt∈\(0,1\)\}t=1T\\\{\\beta\_\{t\}\\in\(0,1\)\\\}\_\{t=1\}^\{T\}, yielding distributionsq\(𝒙t\|𝒙t−1\)=𝒩\(𝒙t;1−βt𝒙t−1,βt𝐈\)q\(\\bm\{x\}\_\{t\}\|\\bm\{x\}\_\{t\-1\}\)=\\mathcal\{N\}\(\\bm\{x\}\_\{t\};\\sqrt\{1\-\\beta\_\{t\}\}\\bm\{x\}\_\{t\-1\},\\beta\_\{t\}\\mathbf\{I\}\)andq\(𝒙1:T\|𝒙0\)=∏t=1Tq\(𝒙t\|𝒙t−1\)q\(\\bm\{x\}\_\{1:T\}\|\\bm\{x\}\_\{0\}\)=\\prod\_\{t=1\}^\{T\}q\(\\bm\{x\}\_\{t\}\|\\bm\{x\}\_\{t\-1\}\)\. After definingαt≜1−βt\\alpha\_\{t\}\\triangleq 1\-\\beta\_\{t\}andα¯t≜∏i=1tαi\\bar\{\\alpha\}\_\{t\}\\triangleq\\prod\_\{i=1\}^\{t\}\\alpha\_\{i\}, the diffusion kernel is given as𝒙t=α¯t𝒙0\+1−α¯tϵ\\bm\{x\}\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\bm\{x\}\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\bm\{\\epsilon\}, whereϵ\\bm\{\\epsilon\}is the Gaussian noise\. The reverse denoising process reverses the diffusion process to generate data𝒙0\\bm\{x\}\_\{0\}by denoising inTTsteps:p\(𝒙T\)=𝒩\(𝒙T;𝟎,𝐈\)p\(\\bm\{x\}\_\{T\}\)=\\mathcal\{N\}\(\\bm\{x\}\_\{T\};\\mathbf\{0\},\\mathbf\{I\}\)andpθ\(𝒙t−1\|𝒙t\)=𝒩\(𝒙t−1;μθ\(𝒙t,t\),σt2𝐈\)p\_\{\\theta\}\(\\bm\{x\}\_\{t\-1\}\|\\bm\{x\}\_\{t\}\)=\\mathcal\{N\}\(\\bm\{x\}\_\{t\-1\};\\mu\_\{\\theta\}\(\\bm\{x\}\_\{t\},t\),\\sigma^\{2\}\_\{t\}\\mathbf\{I\}\), whereμθ\\mu\_\{\\theta\}is a trainable network\. After a few arithmetic operations and reparameterisation, we can write down the variational objective as:minθ𝔼\[λt‖ϵ−ϵθ\(α¯t𝒙0\+1−α¯tϵ,t\)‖2\]\\min\_\{\\theta\}\\mathbb\{E\}\\left\[\\lambda\_\{t\}\\left\\lVert\\bm\{\\epsilon\}\-\\bm\{\\epsilon\}\_\{\\theta\}\\left\(\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\bm\{x\}\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\bm\{\\epsilon\},t\\right\)\\right\\rVert^\{2\}\\right\], where the expectation is over𝒙0∼q\(𝒙0\),t∼𝒰\{1,T\},ϵ∼𝒩\(𝟎,𝐈\)\\bm\{x\}\_\{0\}\\sim q\(\\bm\{x\}\_\{0\}\),t\\sim\\mathcal\{U\}\\\{1,T\\\},\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\), and𝒙t\\bm\{x\}\_\{t\}is substituted with𝒙0\\bm\{x\}\_\{0\}using𝒙t=α¯t𝒙0\+1−α¯tϵ\\bm\{x\}\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\bm\{x\}\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\bm\{\\epsilon\}\.ho2020denoisingobserve that simply settingλt\\lambda\_\{t\}to11for allttworks best in practice\.
The denoising process generates a data distribution from noise without specific conditions\. Classifier\-Free Guidance \(CFG\) is a way to modify an unconditional denoising process by conditioning on a discrete random variable𝒄\\bm\{c\}in order to control the generation process\(ho2022classifier\)\. In practice, CFG operates by mixing the predictions of a conditional diffusion modelpθ\(𝒙\|𝒄\)p\_\{\\theta\}\(\\bm\{x\}\|\\bm\{c\}\)and an unconditional diffusion modelpθ\(𝒙\)p\_\{\\theta\}\(\\bm\{x\}\), both of which can be parameterised via a shared network architecture\. For training the unconditional model, the condition is masked to a null token∅\\emptyset\. For sampling, the output at time stepttis given byϵ^\(𝒙t,t,𝒄\)=\(1\+w\)⋅ϵθ\(𝒙t,t,𝒄\)−w⋅ϵθ\(𝒙t,t,∅\)\\hat\{\\bm\{\\epsilon\}\}\(\\bm\{x\}\_\{t\},t,\\bm\{c\}\)=\(1\+w\)\\cdot\\bm\{\\epsilon\}\_\{\\theta\}\(\\bm\{x\}\_\{t\},t,\\bm\{c\}\)\-w\\cdot\\bm\{\\epsilon\}\_\{\\theta\}\(\\bm\{x\}\_\{t\},t,\\emptyset\)\.
## 4\.Method
### 4\.1\.Map Layout Inference with Diffusion Models
In our first module, we aim to train a diffusion model that can infer anS×SS\\times Slocal mapmmaround the agent, conditioned on the agent’s observation historyτ\\tau\.
Dataset CollectionTo train the diffusion model, we collect a dataset\(mn,τn\)\(m\_\{n\},\\tau\_\{n\}\):τn\\tau\_\{n\}is the agent’s observation history\{on0,on1,…,onT\}\\\{o\_\{n0\},o\_\{n1\},\\dots,o\_\{nT\}\\\}, andmnm\_\{n\}is the ground truth environment map\. We discretized each map into grid cells, and each map instance in our dataset is represented as a 2D binary occupancy grid, which isS×SS\\times Sin size and centered at the agent’s current position\. During the collection process, the agent navigates using a map that is intentionally distorted by noise\. This noisy map is the ground truth map with each cell flipped with probabilitypflipp\_\{flip\}\. We then use A\* algorithm to plan waypoints for the agent’s movement\. This method is designed to collect observation history in scenarios where the agent does not have access to an accurate map during testing\.
Figure 2\.Map embedding relies on multi\-head attention to selectively aggregate past observations to generate conditional embeddings for each grid cell in the local map\.Map Representation EmbeddingTo effectively adapting to varying history length, we leverage multi\-head attention to enable the model to focus selectively on relevant aspects of the observation history for each cell\. This yields extra benefits when the agent remains in the same location for long periods, since simple averaging or pooling of the observation history could dilute important information\. In specific, given a trajectory of observation historyτ=o1:T\\tau=o\_\{1:T\}, we aggregate observationsoto\_\{t\}based on the cell position\(it,jt\)\(i\_\{t\},j\_\{t\}\)where the agent receives those observations, as shown in[Figure2](https://arxiv.org/html/2606.18888#S4.F2)\. Specifically, for each grid cell\(i,j\)\(i,j\)in theS×SS\\times Slocal mapmm, we define the history sequence as the set of all past observations that occurred in that cell:Hij=\{ot∣it=i,jt=j\}H^\{ij\}=\\\{o\_\{t\}\\mid i\_\{t\}=i,j\_\{t\}=j\\\}\.If a cell has not been visited, its history remains empty, i\.e\.,Hij=∅H^\{ij\}=\\emptyset\. To effectively condition the U\-Net, we convert the sequence in each grid into an embedding through a multi\-head attention\. More specifically, for each grid cell\(i,j\)\(i,j\), we map its observation sequenceHijH^\{ij\}to key𝐤ti,j\\mathbf\{k\}^\{i,j\}\_\{t\}and value𝐯ti,j\\mathbf\{v\}^\{i,j\}\_\{t\}\. A shared learnable query𝐪\\mathbf\{q\}attends to these via multi\-head attention\. Unvisited cells \(oti,j=∅o^\{i,j\}\_\{t\}=\\emptyset\) are assigned zero embeddings\. The resulting embeddings\{𝐞ij\}i,j∈S×S\\\{\\mathbf\{e\}\_\{ij\}\\\}\_\{i,j\\in S\\times S\}form the conditional input patch for the diffusion model\.
Diffusion Model TrainingWe utilize Denoising Diffusion Probabilistic Models \(DDPM\)ho2020denoisingto train our map inference model because DDPM can effectively capture multi\-modality distribution, and are less susceptible to mode collapse than Generative Adversarial Networks \(GANs\)dhariwal2021diffusion\. We implement a U\-NetRonneberger2015unetas the backbone in the diffusion model\. Following classifier\-free guidanceho2022classifier, we randomly mask\{𝐞ij\}i,j∈S×S\\\{\\mathbf\{e\}\_\{ij\}\\\}\_\{i,j\\in S\\times S\}to zero with a probability of 0\.1 for the unconditional model\. Instead of predicting noise, We directly predict the original map𝒙0=m\\bm\{x\}\_\{0\}=m, similar to DALLE\(ramesh2022hierarchical\)\. Our model learns to reconstruct the original map by optimizing the following objective:
ℒ\(θ\)=𝔼t,ϵ,m,𝐞\[∥m−ϵθ\(mt,t,𝐞\)∥2\],\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\_\{t,\\epsilon,m,\\mathbf\{e\}\}\\left\[\\lVert m\-\\epsilon\_\{\\theta\}\(m\_\{t\},t,\\mathbf\{e\}\)\\rVert^\{2\}\\right\],\(1\)wheret∼𝒰\{1,2,…,N\}t\\sim\\mathcal\{U\}\\\{1,2,\\ldots,N\\\}represents the diffusion timestep,ϵ∼𝒩\(𝟎,𝐈\)\\epsilon\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)is the noise target, andmtm\_\{t\}is the noisy version of the map, given bymt=α¯tm\+1−α¯tϵm\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}m\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\bm\{\\epsilon\}\. We incorporate the conditional embedding and the 2D noise input by concatenating them and feeding the combined input into the diffusion model\. For sampling from the denoised imaginary layout, we first use the conditional model to obtain the model input and then follow the denoising process outlined inho2020denoising\. Finally, we apply a threshold of 0\.5 to the samples to generate a binary matrix\.
### 4\.2\.Navigation Planning
The trained model will generate plausible map layouts based on prior observations\. A ”plausible” map layout marks as empty for those grids traversed by the agent, and marks as blocks for those where the agent encountered obstacles, and extrapolates map structure in unobserved areas based on learned distributions, conditioned on empty and block grids\. To enable planning, we approximate the distribution of potential map layouts, denoted asP\(m\|τ\)P\(m\|\\tau\), by sampling a set𝒮N\\mathcal\{S\}\_\{N\}ofNNmap samples of sizeS×SS\\times Sfrom the diffusion model conditioned on the observed trajectoryτ\\tau\. These samples will be used for 1\) estimating the transition function in local regions\. and 2\) proposing candidate actions for planning\.
- •To estimate the dynamic model, we observe that the sample distribution reveals consensus on block placements near observed regions while exhibiting diversity in unexplored areas\. To model dynamics, we conservatively aggregate all generated local maps to form one mapmaggm\_\{\\text\{agg\}\}by treating a cell as a block if it is marked as a block in at leastN⋅λpN\\cdot\\lambda\_\{p\}of the generated samples \(λp\\lambda\_\{p\}is a threshold\), and as empty otherwise, even for areas outside of the generated local maps\. As shown in[Figure1](https://arxiv.org/html/2606.18888#S1.F1), the areas outside of the local maps are all marked as empty \(in grey\)\.
- •To produce action proposals, we generate a set of candidate paths, denoted aspip\_\{i\}, by applying a planning algorithm to each individual sampled map layout\. Note that this step is independent of the previous dynamic prediction\. For each sampled map, planning is conducted with the same start position \(i\.e\., agent’s current position\) and the goal position\.
All path candidates generated are then evaluated with the aggregated dynamics model,maggm\_\{\\text\{agg\}\}: any path violating the layout inmaggm\_\{\\text\{agg\}\}will receive zero reward\. Finally, we select the path with the highest reward as the final plan\.
In our planning module, the parameterSSdefines the size of the local map used for diffusion, effectively setting the spatial scope of our belief representation\. This is analogous to the planning horizon in MPC but applied within the belief space in a spatial context\. Importantly,SSaffects action proposals by constraining the spatial extent of the dynamic model during planning\. A largerSSprovides a broader spatial context, allowing planning to capture more global map structures and potentially make long\-term decisions, but comes with a price: largeSSyields high computational cost in inference and requires more diverse training data to learn effective representations for large map regions\. The number of samples,NN, representing the number of action proposals, also introduces a trade\-off\. IncreasingNNenables a more comprehensive exploration of possible map configurations, potentially leading to more efficient paths and reducing the risk of suboptimal layouts\. However, a largerNNalso increases computational costs and may result in infeasible solution proposal\.
We implement the A\* search algorithm for path planning by assigning a cost of 1 to moving through empty cells and penalizing inferred blocks with a cost greater than 1\. Manhattan distance serves as our heuristic function\. Although infinite costs for obstacles might seem ideal, we find that they can cause planning failures when inferred maps contain slight disconnections\. The selected path is then converted into waypoints for the agent to follow using a greedy strategy\. At each step, the agent chooses the action \(move forward, turn left, or turn right\) that best advances towards the next waypoint\. For complex movement, more sophisticated waypoint\-following or reactive control could improve performance\.
During testing, our system operates iteratively as follows: the agent collects observations, the layout inference module generates plausible map configurations, the navigation planning module plans the optimal path, and the agent executes actions to reach the next waypoints and gather new information\. Once the agent reaches the first waypoint, a new round of generation and planning starts\. This procedure repeats until the agent either reaches the goal or exceeds the predetermined step limit\.
## 5\.Experiments
### 5\.1\.Setup of the Experiment
Map DatasetWe constructed a synthetic map dataset inspired by point\-goal navigation in indoor environmentssavva2019habitat\. Each map instance in our dataset is represented as a 2D binary occupancy grid\. We chose to develop our own synthetic dataset rather than using existing navigation datasets from embodied AI simulations like Habitat for two reasons: 1\. Existing datasets are often integrated with too complex simulators that are difficult to customize for our specific requirements\. 2\. We can have precise control over dataset quality with synthetic tasks\. For example, the Gibson PointNav datasetxia2018gibsonused in the Habitat simulatorsavva2019habitatcontains navigation tasks with notable limitations: 23% of validation tasks can be solved by direct move towards the goal, most tasks have relatively short routes, and some scenes contain poorly reconstructed areas with holes in the floor that disrupt navigation\.
In contrast, our synthetic methodology grants us fine\-grained control over map properties, enabling the generation of a diverse dataset specifically designed to isolate the navigation task itself\. For the training dataset, we randomly generated 100 maps and collected 50,000 episodes, with each episode having a maximum length of 200\. The generated maps are 64×64 in size, and map\-generation details are provided in the supplementary material\.
Implementation DetailsOur diffusion model is based on a U\-Net architectureRonneberger2015unetthat consists of two downsampling and upsampling stages, with each resolution level containing two ResNet blocks\. An attention layer is applied at an 8×8 resolution, and we use group normalization and ReLU activation throughout the model, along with a 0\.1 dropout rate during training\. Trajectory information is integrated into both the input and skip connections through concatenation\. Our diffusion training and sampling employ a linear noise schedule and utilize the DDIM sampling algorithmsong2020denoisingfor efficient inference\. During evaluation, we perform sampling with 20 steps\. Hyperparameters are reported in the supplementary material\.
### 5\.2\.Evaluation Results
In this section, we address the following questions: 1\) Can diffusion models generate maps suitable for navigation? 2\) Does the explicit belief representation improve navigation performance compared to end\-to\-end approaches such as LSTM\-PPO? 3\) Can generative models that model at the observation level achieve comparable performance? 4\) How do the hyperparameters of the generative model impact planning performance?
#### Diffusion Models Generate Plausible and Diverse Layouts
To assess the ability of our diffusion model to generate navigable maps, we train the diffusion model on 16×16 maps and compare it against a deterministic prediction model trained using supervised learning with an MSE objective and the same U\-Net architecture\. Figure[3](https://arxiv.org/html/2606.18888#S5.F3)presents visualizations of map samples generated by both models\. The deterministic prediction model produces maps that do not exhibit any structure in the map dataset, such as paths, failing to capture the variety of plausible layouts\.
To further assess the quality of the generated maps, we conducted tests using agent random walks and recorded the blocks they traversed and interacted with\. Our results show that our model leaves97\.1%of the traversed blocks empty and correctly marks95\.9%of agent\-collision blocks as walls\. These findings highlight the model’s ability to accurately identify key map elements, demonstrating its effectiveness in generating meaningful and navigable layouts\.
Figure 3\.Generated maps from diffusion model and an deterministic prediction model\.
#### Navigation Performance Comparison with LSTM\-PPO
Next, we compare the navigation performance of our approach against LSTM\-PPOwijmans2019dd, a strong baseline in Navigation\.
We evaluate each method using 2000 navigation episodes across 100 different test maps which do not exist in training dataset\. A timeout of 200 steps was imposed for each episode\. We use two evaluation metrics:Success Rate \(SR\)andSuccess weighted by inverse Path Length \(SPL\)savva2019habitat\. Success Rate \(SR\) represents the percentage of episodes where the agent successfully reached the goal\. And SPL defined asSPL=Slmax\(p,l\)\\text\{SPL\}=S\\frac\{l\}\{\\max\(p,l\)\}, whereSSis a binary success indicator \(1 if successful, 0 otherwise\),ppis the path length taken by the agent, andllis the optimal \(shortest\) path length\. The agent had three possible actions at each step: move forward 0\.5 meters, turn left by 30 degrees, or turn right by 30 degrees\. The agent’s observation space consisted of its relative position to the goal and its current orientation, mimicking a scenario where navigation relies solely on GPS and compass sensors\. We also included a simpleGoal\-Followerbaseline, adapted from the Habitat Challengesavva2019habitat, where the agent directly navigates towards the goal in a straight line\. The results of this comparative analysis are summarized in Table[1](https://arxiv.org/html/2606.18888#S5.T1)\. Our method uses fixed parameters with a map size of 8x8, a sample number of N=10, and a CFG weight of 0\.5\.
For the LSTM\-PPO baseline, we use a two\-layer LSTM network with a hidden dimension of 512\. The hyperparameters for PPO and the network architecture remain consistent with those used inwijmans2019dd\. We also use the same reward design, which includes a terminal reward and a shaped reward\. The agent receives a terminal reward ofrT=2\.5r\_\{T\}=2\.5SPL if it is within 1m of the target\. The shaped reward at each step isrt\(at,st\)=−Δgeo dist−0\.01r\_\{t\}\(a\_\{t\},s\_\{t\}\)=\-\\Delta\_\{\\text\{geo dist\}\}\-0\.01, whereΔgeo dist\\Delta\_\{\\text\{geo dist\}\}is the geodesic distance change, and−0\.01\-0\.01is a step penalty to promote efficiency\. We use the fast marching methodsethian1999fastto approximate the geodesic distance\.
Table 1\.The Navigation performance comparison of our method against LSTM\-PPO\. The numbers in parentheses indicate the environment interaction steps for LSTM\-PPO and the offline dataset size for our method\.The Goal\-Follower baseline, which follows a straight\-line path, performs poorly, highlighting the difficulty of our navigation tasks\. In contrast, our method outperforms the best LSTM\-PPO agent, which achieves its best performance after 500M interactions with the environment\. We report LSTM\-PPO’s performance at 500M because, beyond this point, its performance experiences instability and diminishing returns\. On the other hand, our model reaches the same performance while being trained on a dataset that is only 1/500th the size of what the best LSTM\-PPO requires\.
Furthermore, we observe that LSTM\-PPO struggles with instability when the training task becomes too complex, limiting its ability to learn from episodes that demand long\-term planning\. Our diffusion\-based approach, which uses an explicit belief representation, offers more robust and sample\-efficient navigation\. Additional visualizations can be found in the appendix\.
#### Comparison with Deterministic and Generative Baselines
To show that the diffusion model for map layout modeling is necessary, we included the following baselines: 1\.Deterministic Prediction Model: In this approach, we replace diffusion with a deterministic prediction model\. 2\.Trajectory Diffusion: We train a 5\-layer transformer on the same dataset to predict the next 10 steps based on the observation history of positions relative to the goal position\. During testing, we use these predictions as waypoints for the agent to follow greedily, similar to our method\.
Table 2\.The Navigation performance comparison of our method against LSTM\-PPO\.The results demonstrate that our method significantly outperforms both deterministic and generative models on trajectory prediction in navigation tasks\. The deterministic prediction model exhibits more stable performance than the trajectory generative approach, highlighting that learning a map model is usually easier than predicting future dynamics, and that generative models at the observation level may require carefully designed, high\-quality datasets to be effective\. However, deterministic methods’ inability to capture multimodal distributions limits their effectiveness in complex environments\.
#### Impact of Hyperparameters on Planning Performance
Finally, we examine the influence of key hyperparameters on navigation performance, focusing on two key parameters:Inference Map Size \(S\)andBelief Space Sample Number \(N\)\. For each ablation, we report the best performance for CFG weights of 0\.0 and 0\.5, and sample sizes of N = 10 and 15 \(except for the ablation of the number of samples\)\.
Table 3\.Navigation Performance with Different Map Sizes and Dataset SizesThe results in Table[3](https://arxiv.org/html/2606.18888#S5.T3)highlight that a map size of 8 consistently delivers the best performance\. Additionally, navigation performance improves as the dataset size increases for all window sizes\. Increasing the map size beyond 8 either results in marginal performance reduction or no significant improvement\. This suggests that larger inference maps introduce added complexity without yielding better contextual understanding when the data is limited\. Furthermore, smaller map sizes, such as 4, cause the model to struggle with long\-term planning\. We observe that this often leads to paths with frequent back\-and\-forth movements due to limited context length\.
Table 4\.Navigation Performance with Different Window Sizes and Sample NumberFrom Table[4](https://arxiv.org/html/2606.18888#S5.T4), we observe significant performance improvements when increasing N from 1 to 5 and then from 5 to 10 across all window sizes\. However, the performance gains from further increasing N to 15 are much smaller, with some cases showing a slight decrease\. This suggests that with N=10 samples, the belief space is already adequately represented for effective navigation\. Increasing the sample number beyond this point offers only marginal improvements in belief accuracy, which may lead to infeasible path suggestions and a decrease in navigation performance, particularly with larger window contexts\.
## 6\.Conclusion
In this paper, we presented BeliefDiffusion, a novel framework for point\-goal navigation in partially observable environments that explicitly models multimodal belief states using diffusion models\. By generating a distribution of plausible map layouts conditioned on observation history and integrating these layouts into a Model Predictive Control \(MPC\) planner, BeliefDiffusion achieved robust and efficient navigation in complex and unknown environments\. Our empirical evaluation in synthetic 2D grid worlds demonstrated that BeliefDiffusion significantly outperforms both model\-free reinforcement learning methods like LSTM\-PPO and generative models relying on unimodal approximations or trajectory prediction, particularly in terms of sample efficiency and navigation performance\. While our results are promising, the computational demands of belief sampling for each planning step could be further improved in real\-time applications\. Future research directions include optimizing the computational efficiency of belief sampling and planning, and exploring the broader applicability of our BeliefDiffusion framework to other POMDP beyond point\-goal navigation\.
## ReferencesSimilar Articles
Learning A Unified Risk Map for Autonomous Driving in Partially Observable Environments
Proposes a unified risk map modeling framework for autonomous driving that integrates traffic flow and collision risks in partially observable environments, using spatiotemporal modeling and diffusion-based scenario generation. Outperforms state-of-the-art occlusion-aware baselines on the Waymo Open Motion Dataset.
From Noise to Control: Parameterized Diffusion Policies
This paper introduces Parameterized Diffusion Policy (PDP), a framework that makes diffusion policies controllable by conditioning on low-dimensional latent parameters, enabling smooth behavior interpolation and adaptation without retraining. It demonstrates improved performance on complex multimodal robot tasks in simulation and real-world experiments.
Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization
Proposes Model-Based Diffusion Policy Optimization (MBDPO), a framework that unifies search and policy optimization in world models using diffusion policy representations, achieving consistent scaling behavior and superior performance across offline and online reinforcement learning tasks.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 is a new discrete diffusion planner for autonomous driving that uses reinforcement learning to enable self-editing of trajectory tokens, achieving high performance and low latency on the NAVSIM benchmark.
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
This paper introduces a novel adaptive scheduler for steering discrete diffusion language models using sparse autoencoders, demonstrating that targeting interventions based on when specific attributes commit improves control quality and strength over uniform methods.