Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
Summary
GeoVR enhances multimodal large language models with 3D awareness by restructuring their semantic latent space through geometric knowledge distillation from 3D foundation models using multiple geometric targets.
View Cached Full Text
Cached at: 06/05/26, 06:09 PM
Paper page - Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
Source: https://huggingface.co/papers/2606.05833
Abstract
GeoVR enhances multimodal large language models with 3D awareness by restructuring their semantic latent space through geometric knowledge distillation from 3D foundation models using multiple geometric targets.
Multimodal Large Language Models(MLLMs) excel at 2D semantic understanding but lack intrinsic3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learnsgeometric representationsusing purely 2D video sequences. This approach effectively restructures thesemantic latent spacewithin MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained3D foundation models. This is accomplished through amulti-objective learningstrategy driven by four complementary geometric targets: (1) estimating inter-framecamera posesto embed varying viewpoint dynamics, (2) regressingdense depth mapsto anchor physical distances, (3) predicting ametric scale factorfor real-world calibration, and (4) distillingmulti-scale 3D featuresto align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model’s internal representations naturally develop strong3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.
View arXiv pageView PDFGitHub3Add to collection
Get this paper in your agent:
hf papers read 2606\.05833
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### WHB139426/GeoVR Updatedabout 3 hours ago • 11
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.05833 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.05833 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors
Stream3D-VLM is an online 3D vision-language model that enables real-time spatial understanding from streaming video by incrementally integrating geometry priors and using geometry-adaptive voxel compression, outperforming existing models on 3D spatial understanding tasks.
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
This paper proposes GASP, a framework that injects geometric priors into vision-language models via deep supervision with contrastive and depth consistency losses, achieving significant improvements on 3D spatial reasoning benchmarks without using 3D VQA data.
Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models
This paper presents a systematic frozen-feature probing study comparing vision-language models (VLMs) and video generation models (VGMs) on spatial intelligence tasks. It finds that VLMs excel at semantic tagging and instance grouping, while VGMs provide better dense geometry and camera motion signals, and a naive fusion of both yields strong performance across all axes.
Geo-Align: Video Generation Alignment via Metric Geometry Reward
Geo-Align presents a reinforcement learning framework for camera-controlled video re-rendering that improves generalization through scale-aware perceptual rewards and metric 3D estimation for camera trajectory extraction.
GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs
GeoStack introduces a geometric framework to compose independently trained domain experts in Vision-Language Models without catastrophic forgetting, achieving constant-time inference and a 10x reduction in geometric error.