Unlocking Dense Metric Depth Estimation in VLMs

Hugging Face Daily Papers 05/15/26, 12:00 AM Papers

Summary

DepthVLM enhances Vision-Language Models with a lightweight depth head and unified vision-text supervision, achieving dense metric depth estimation and improved 3D spatial reasoning while maintaining multimodal capabilities.

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.

Original Article

View Cached Full Text

Cached at: 05/18/26, 06:24 AM

Paper page - Unlocking Dense Metric Depth Estimation in VLMs

Source: https://huggingface.co/papers/2605.15876

Abstract

DepthVLM enhances Vision-Language Models with dense geometry prediction through a lightweight depth head and unified vision-text supervision, achieving superior 3D spatial reasoning while maintaining multimodal capabilities.

Vision-Language Models(VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery ofdense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a nativedense geometrypredictor while preserving its multimodal capability. By attaching a lightweightdepth headto the LLM backbone and training under a unifiedvision-text supervisionparadigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex3D spatial reasoning, moving toward a trulyunified foundation model. All code and checkpoints will be publicly released.

View arXiv page View PDF Project page GitHub8 Add to collection

Get this paper in your agent:

hf papers read 2605\.15876

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### JonnyYu828/DepthVLM-4B Depth Estimation• 5B• Updatedabout 2 hours ago • 10 • 2

Datasets citing this paper1

#### JonnyYu828/DepthVLM-Bench Preview• Updatedabout 2 hours ago • 2

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.15876 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Unlocking Dense Metric Depth Estimation in VLMs

Paper page - Unlocking Dense Metric Depth Estimation in VLMs

Abstract

Models citing this paper1

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Submit Feedback

Similar Articles

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning