Unlocking Dense Metric Depth Estimation in VLMs
Summary
DepthVLM enhances Vision-Language Models with a lightweight depth head and unified vision-text supervision, achieving dense metric depth estimation and improved 3D spatial reasoning while maintaining multimodal capabilities.
View Cached Full Text
Cached at: 05/18/26, 06:24 AM
Paper page - Unlocking Dense Metric Depth Estimation in VLMs
Source: https://huggingface.co/papers/2605.15876
Abstract
DepthVLM enhances Vision-Language Models with dense geometry prediction through a lightweight depth head and unified vision-text supervision, achieving superior 3D spatial reasoning while maintaining multimodal capabilities.
Vision-Language Models(VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery ofdense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a nativedense geometrypredictor while preserving its multimodal capability. By attaching a lightweightdepth headto the LLM backbone and training under a unifiedvision-text supervisionparadigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex3D spatial reasoning, moving toward a trulyunified foundation model. All code and checkpoints will be publicly released.
View arXiv pageView PDFProject pageGitHub8Add to collection
Get this paper in your agent:
hf papers read 2605\.15876
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### JonnyYu828/DepthVLM-4B Depth Estimation• 5B• Updatedabout 2 hours ago • 10 • 2
Datasets citing this paper1
#### JonnyYu828/DepthVLM-Bench Preview• Updatedabout 2 hours ago • 2
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.15876 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.
Large Vision-Language Models Get Lost in Attention
This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.
GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs
GeoStack introduces a geometric framework to compose independently trained domain experts in Vision-Language Models without catastrophic forgetting, achieving constant-time inference and a 10x reduction in geometric error.
@cjzafir: VLMs (Vertical Language Models) are beating top LLMs. These small 7B to 15B niche-focused models are beating SoTA model…
The author demonstrates that small vertical language models (6B-15B) can outperform top LLMs on niche benchmarks through cost-effective fine-tuning using open-source models and Codex orchestration, achieving results with a $300 dataset.
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
MemLens is a new benchmark for evaluating memory capabilities in large vision-language models through multi-session conversations. It compares long-context and memory-augmented approaches, revealing limitations in both and motivating hybrid architectures.