Unlocking Dense Metric Depth Estimation in VLMs

Hugging Face Daily Papers Papers

Summary

DepthVLM enhances Vision-Language Models with a lightweight depth head and unified vision-text supervision, achieving dense metric depth estimation and improved 3D spatial reasoning while maintaining multimodal capabilities.

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:24 AM

Paper page - Unlocking Dense Metric Depth Estimation in VLMs

Source: https://huggingface.co/papers/2605.15876

Abstract

DepthVLM enhances Vision-Language Models with dense geometry prediction through a lightweight depth head and unified vision-text supervision, achieving superior 3D spatial reasoning while maintaining multimodal capabilities.

Vision-Language Models(VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery ofdense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a nativedense geometrypredictor while preserving its multimodal capability. By attaching a lightweightdepth headto the LLM backbone and training under a unifiedvision-text supervisionparadigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex3D spatial reasoning, moving toward a trulyunified foundation model. All code and checkpoints will be publicly released.

View arXiv pageView PDFProject pageGitHub8Add to collection

Get this paper in your agent:

hf papers read 2605\.15876

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### JonnyYu828/DepthVLM-4B Depth Estimation• 5B• Updatedabout 2 hours ago • 10 • 2

Datasets citing this paper1

#### JonnyYu828/DepthVLM-Bench Preview• Updatedabout 2 hours ago • 2

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.15876 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Hugging Face Daily Papers

This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.

Large Vision-Language Models Get Lost in Attention

arXiv cs.AI

This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.