Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Hugging Face Daily Papers Papers

Summary

DR-MV3D presents a map-grounded learning framework with dense rewards to improve multi-view 3D visual question answering through global map construction, view-trajectory planning, and egocentric grounding.

Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.
Original Article
View Cached Full Text

Cached at: 06/23/26, 05:40 AM

Paper page - Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Source: https://huggingface.co/papers/2606.23557

Abstract

DR-MV3D presents a map-grounded learning framework with dense rewards to improve multi-view 3D visual question answering through global map construction, view-trajectory planning, and egocentric grounding.

Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, currentmultimodal LLMsare typically trained with sparse, answer-level supervision, which often yields inconsistentcross-view reasoningand brittleview selection. We present DR-MV3D (Dense RewardforMV3D-VQA), amap-grounded learningframework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposesMV3D-VQAinto (i)allocentric global map construction, (ii)question-conditioned view-trajectory planning, and (iii)egocentric groundingfor answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen3D vision foundation models(e.g.,VGGT+SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-levelpolicy optimization(GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.

View arXiv pageView PDFProject pageGitHub3Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.23557 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.23557 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.23557 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Hugging Face Daily Papers

This paper introduces SR-REAL, a unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning via reinforcement learning, enabling robust multi-step spatial reasoning across diverse tasks.

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Hugging Face Daily Papers

This paper introduces ReasonMatch-Bench, a benchmark for wide-baseline matching in multimodal LLMs, and proposes Dynamic Correspondence Reinforcement Learning (DCRL) to improve spatial reasoning. Experiments show significant gains on the benchmark while maintaining general performance.