Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching
Summary
Lite Any Stereo V2 presents an efficient stereo matching approach achieving state-of-the-art accuracy with significantly reduced latency through optimized architecture and training strategies, including a 2D-only cost aggregation framework and a three-stage training strategy.
View Cached Full Text
Cached at: 06/25/26, 05:13 PM
Paper page - Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching
Source: https://huggingface.co/papers/2606.24457
Abstract
Lite Any Stereo V2 (LAS2) presents an efficient stereo matching approach that achieves state-of-the-art accuracy with significantly reduced latency through optimized architecture and training strategies.
Recent advances instereo matchinghave achieved remarkable accuracy, but often rely on large models, heavy computation, or additional foundation-model priors, making them difficult to deploy on resource-constrained platforms. In contrast,efficient stereo modelsoffer faster inference but are commonly considered less capable of strongzero-shot generalization. In this paper, we challenge this assumption by introducing Lite Any Stereo V2 (LAS2), an ultra-fast model series designed for efficient zero-shotstereo matching. LAS2 is developed from both architecture and training perspectives. Architecturally, we revisit efficient stereo design under practical deployment settings and propose a 2D-onlycost aggregation framework, optimized for real inference latency rather than theoretical MACs alone. For training, we develop a three-stage strategy that combinessynthetic supervision,self-distillation, and real-worldknowledge distillation. To improve the reliability of real-world pseudo supervision, we further introducepseudo-label filteringand anerror-clampingoperation, enabling smoother synthetic-to-real transfer. We instantiate LAS2 as a family of models, includingfeed-forward variantsfor different efficiency budgets and aniterative variantfor higher accuracy. Extensive experiments show that LAS2 achieves state-of-the-art accuracy among efficient stereo methods while maintaining significantly lower latency. Specifically, LAS2-H achieves stronger overall zero-shot performance than the iterative method Fast-FoundationStereo, with 1.8x and 2.7x faster inference on H200 and Orin, respectively. The project page, demos, and code are available at https://tomtomtommi.github.io/LiteAnyStereoV2/.
View arXiv pageView PDFProject pageGitHub99Add to collection
Get this paper in your agent:
hf papers read 2606\.24457
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.24457 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.24457 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.24457 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification
Introduces AnySimLite, a lightweight similarity encoder for on-device speech-adjacent classification tasks, achieving state-of-the-art or competitive performance while using less than 1/250th the model size of the qLLaMA-LoRA-7B baseline.
Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.
LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs
LiteFrame proposes a lightweight video encoder with Compressed Token Distillation training that reduces latency and enables processing 8x more frames for long-form video understanding in Video LLMs, improving accuracy while reducing compute.
LiteFrame Scales Video LLM Efficiency (6 minute read)
LiteFrame introduces a highly efficient video encoder for Video LLMs that uses Compressed Token Distillation to enable up to 8x more frames and 35% latency reduction while maintaining accuracy, setting a new Pareto frontier for long-form video understanding.
αDepth: Learning Single-Pass Soft Boundary Decomposition for Stereo Conversion
αDepth introduces a layered representation with Circular Alpha Representation (CAR) to address soft boundary challenges in stereo conversion, achieving state-of-the-art performance without manual guidance.