Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning
Summary
A training-free framework for spatial reasoning from egocentric videos that enables revisiting conclusions through synthesized novel-view videos generated from predicted 3D geometry.
View Cached Full Text
Cached at: 06/11/26, 01:38 PM
Paper page - Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning
Source: https://huggingface.co/papers/2606.11683
Abstract
A training-free framework for spatial reasoning from egocentric videos that enables revisiting conclusions through synthesized novel-view videos generated from predicted 3D geometry.
Spatial reasoningfromegocentric videosis inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue thatspatial reasoningshould be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, anMLLMforms aspatial hypothesisfrom the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesizednovel-view video. To enable effectivecross-view revisiting, we design aGeometry-to-Video pipelinethat renders strategically complementary novel views from predicted3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving theMLLM’s native video interface without architectural modifications. Extensive evaluations onVSI-BenchandSTI-Benchdemonstrate that ReRe substantially boosts open-sourceMLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.11683
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.11683 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.11683 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.11683 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Reinforcing Dual-Path Reasoning in Spatial Vision Language Models
This paper introduces SR-REAL, a unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning via reinforcement learning, enabling robust multi-step spatial reasoning across diverse tasks.
Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators
The paper proposes Astra, an agentic spatial reasoning framework that couples a reinforcement learning-trained VLM policy with a world simulator to generate novel-view observations for improved spatial reasoning in Vision-Language Models.
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.
The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning
This paper proposes a self-supervised reinforcement learning framework that uses consistency verifiers—reward functions checking geometric and semantic consistency under transformations—to improve spatial reasoning in large reasoning models without requiring ground-truth annotations. The method approaches the accuracy of supervised fine-tuning and generalizes across diverse tasks.
SVoT: State-aware Visualization-of-Thought for Spatial Reasoning via Reinforcement Learning
The paper proposes SVoT, a reinforcement learning framework that generates interleaved, verifiable intermediate states and visualizations for multi-hop spatial reasoning in MLLMs, achieving significant accuracy gains on new benchmarks involving multi-object interactions and numerical reasoning.