Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views
Summary
DR-MV3D presents a map-grounded learning framework with dense rewards to improve multi-view 3D visual question answering through global map construction, view-trajectory planning, and egocentric grounding.
View Cached Full Text
Cached at: 06/23/26, 05:40 AM
Paper page - Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views
Source: https://huggingface.co/papers/2606.23557
Abstract
DR-MV3D presents a map-grounded learning framework with dense rewards to improve multi-view 3D visual question answering through global map construction, view-trajectory planning, and egocentric grounding.
Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, currentmultimodal LLMsare typically trained with sparse, answer-level supervision, which often yields inconsistentcross-view reasoningand brittleview selection. We present DR-MV3D (Dense RewardforMV3D-VQA), amap-grounded learningframework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposesMV3D-VQAinto (i)allocentric global map construction, (ii)question-conditioned view-trajectory planning, and (iii)egocentric groundingfor answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen3D vision foundation models(e.g.,VGGT+SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-levelpolicy optimization(GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.
View arXiv pageView PDFProject pageGitHub3Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.23557 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.23557 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.23557 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning
A training-free framework for spatial reasoning from egocentric videos that enables revisiting conclusions through synthesized novel-view videos generated from predicted 3D geometry.
iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning
Introduces iVGR, a reinforcement learning framework that internalizes visual localization into textual reasoning for multimodal language models, eliminating the need for explicit visual grounding during inference while improving fine-grained perception performance.
Reinforcing Dual-Path Reasoning in Spatial Vision Language Models
This paper introduces SR-REAL, a unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning via reinforcement learning, enabling robust multi-step spatial reasoning across diverse tasks.
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
This paper proposes GASP, a framework that injects geometric priors into vision-language models via deep supervision with contrastive and depth consistency losses, achieving significant improvements on 3D spatial reasoning benchmarks without using 3D VQA data.
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
This paper introduces ReasonMatch-Bench, a benchmark for wide-baseline matching in multimodal LLMs, and proposes Dynamic Correspondence Reinforcement Learning (DCRL) to improve spatial reasoning. Experiments show significant gains on the benchmark while maintaining general performance.