Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Hugging Face Daily Papers 06/22/26, 12:00 AM Papers

Summary

DR-MV3D presents a map-grounded learning framework with dense rewards to improve multi-view 3D visual question answering through global map construction, view-trajectory planning, and egocentric grounding.

Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.

Original Article

View Cached Full Text

Cached at: 06/23/26, 05:40 AM

Paper page - Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Source: https://huggingface.co/papers/2606.23557

Abstract

Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, currentmultimodal LLMsare typically trained with sparse, answer-level supervision, which often yields inconsistentcross-view reasoningand brittleview selection. We present DR-MV3D (Dense RewardforMV3D-VQA), amap-grounded learningframework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposesMV3D-VQAinto (i)allocentric global map construction, (ii)question-conditioned view-trajectory planning, and (iii)egocentric groundingfor answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen3D vision foundation models(e.g.,VGGT+SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-levelpolicy optimization(GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.

View arXiv page View PDF Project page GitHub3 Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.23557 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.23557 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.23557 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Paper page - Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Submit Feedback

Similar Articles

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching