Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching

Hugging Face Daily Papers 06/23/26, 12:00 AM Papers

stereo-matching zero-shot efficient computer-vision deep-learning knowledge-distillation cost-aggregation

Summary

Lite Any Stereo V2 presents an efficient stereo matching approach achieving state-of-the-art accuracy with significantly reduced latency through optimized architecture and training strategies, including a 2D-only cost aggregation framework and a three-stage training strategy.

Recent advances in stereo matching have achieved remarkable accuracy, but often rely on large models, heavy computation, or additional foundation-model priors, making them difficult to deploy on resource-constrained platforms. In contrast, efficient stereo models offer faster inference but are commonly considered less capable of strong zero-shot generalization. In this paper, we challenge this assumption by introducing Lite Any Stereo V2 (LAS2), an ultra-fast model series designed for efficient zero-shot stereo matching. LAS2 is developed from both architecture and training perspectives. Architecturally, we revisit efficient stereo design under practical deployment settings and propose a 2D-only cost aggregation framework, optimized for real inference latency rather than theoretical MACs alone. For training, we develop a three-stage strategy that combines synthetic supervision, self-distillation, and real-world knowledge distillation. To improve the reliability of real-world pseudo supervision, we further introduce pseudo-label filtering and an error-clamping operation, enabling smoother synthetic-to-real transfer. We instantiate LAS2 as a family of models, including feed-forward variants for different efficiency budgets and an iterative variant for higher accuracy. Extensive experiments show that LAS2 achieves state-of-the-art accuracy among efficient stereo methods while maintaining significantly lower latency. Specifically, LAS2-H achieves stronger overall zero-shot performance than the iterative method Fast-FoundationStereo, with 1.8x and 2.7x faster inference on H200 and Orin, respectively. The project page, demos, and code are available at https://tomtomtommi.github.io/LiteAnyStereoV2/.

Original Article

View Cached Full Text

Cached at: 06/25/26, 05:13 PM

Paper page - Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching

Source: https://huggingface.co/papers/2606.24457

Abstract

Lite Any Stereo V2 (LAS2) presents an efficient stereo matching approach that achieves state-of-the-art accuracy with significantly reduced latency through optimized architecture and training strategies.

Recent advances instereo matchinghave achieved remarkable accuracy, but often rely on large models, heavy computation, or additional foundation-model priors, making them difficult to deploy on resource-constrained platforms. In contrast,efficient stereo modelsoffer faster inference but are commonly considered less capable of strongzero-shot generalization. In this paper, we challenge this assumption by introducing Lite Any Stereo V2 (LAS2), an ultra-fast model series designed for efficient zero-shotstereo matching. LAS2 is developed from both architecture and training perspectives. Architecturally, we revisit efficient stereo design under practical deployment settings and propose a 2D-onlycost aggregation framework, optimized for real inference latency rather than theoretical MACs alone. For training, we develop a three-stage strategy that combinessynthetic supervision,self-distillation, and real-worldknowledge distillation. To improve the reliability of real-world pseudo supervision, we further introducepseudo-label filteringand anerror-clampingoperation, enabling smoother synthetic-to-real transfer. We instantiate LAS2 as a family of models, includingfeed-forward variantsfor different efficiency budgets and aniterative variantfor higher accuracy. Extensive experiments show that LAS2 achieves state-of-the-art accuracy among efficient stereo methods while maintaining significantly lower latency. Specifically, LAS2-H achieves stronger overall zero-shot performance than the iterative method Fast-FoundationStereo, with 1.8x and 2.7x faster inference on H200 and Orin, respectively. The project page, demos, and code are available at https://tomtomtommi.github.io/LiteAnyStereoV2/.

View arXiv page View PDF Project page GitHub99 Add to collection

Get this paper in your agent:

hf papers read 2606\.24457

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.24457 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.24457 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.24457 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching

Paper page - Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

LiteFrame Scales Video LLM Efficiency (6 minute read)

αDepth: Learning Single-Pass Soft Boundary Decomposition for Stereo Conversion

Submit Feedback

Similar Articles

AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

LiteFrame Scales Video LLM Efficiency (6 minute read)

αDepth: Learning Single-Pass Soft Boundary Decomposition for Stereo Conversion