Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Hugging Face Daily Papers 06/04/26, 12:00 AM Papers

geometric-representations video-understanding multimodal-large-language-models 3d-awareness spatial-intelligence knowledge-distillation

Summary

GeoVR enhances multimodal large language models with 3D awareness by restructuring their semantic latent space through geometric knowledge distillation from 3D foundation models using multiple geometric targets.

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

Original Article

View Cached Full Text

Cached at: 06/05/26, 06:09 PM

Paper page - Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Source: https://huggingface.co/papers/2606.05833

Abstract

Multimodal Large Language Models(MLLMs) excel at 2D semantic understanding but lack intrinsic3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learnsgeometric representationsusing purely 2D video sequences. This approach effectively restructures thesemantic latent spacewithin MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained3D foundation models. This is accomplished through amulti-objective learningstrategy driven by four complementary geometric targets: (1) estimating inter-framecamera posesto embed varying viewpoint dynamics, (2) regressingdense depth mapsto anchor physical distances, (3) predicting ametric scale factorfor real-world calibration, and (4) distillingmulti-scale 3D featuresto align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model’s internal representations naturally develop strong3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

View arXiv page View PDF GitHub3 Add to collection

Get this paper in your agent:

hf papers read 2606\.05833

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### WHB139426/GeoVR Updatedabout 3 hours ago • 11

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.05833 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.05833 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Paper page - Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Geo-Align: Video Generation Alignment via Metric Geometry Reward

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

Submit Feedback

Similar Articles

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Geo-Align: Video Generation Alignment via Metric Geometry Reward

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs