Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Summary
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.
View Cached Full Text
Cached at: 05/13/26, 08:12 AM
Paper page - Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Source: https://huggingface.co/papers/2605.11354
Abstract
Lite3R addresses efficiency challenges in transformer-based 3D reconstruction through sparse attention and low-precision quantization while maintaining geometric accuracy.
Transformer-based 3D reconstructionhas emerged as a powerful paradigm for recovering geometry and appearance from multi-view observations, offering strong performance across challenging visual conditions. As these models scale to larger backbones and higher-resolution inputs, improving their efficiency becomes increasingly important for practical deployment. However, modern 3D transformer pipelines face two coupled challenges: densemulti-view attentioncreates substantial token-mixing overhead, and low-precision execution can destabilize geometry-sensitive representations and degrade depth, pose, and3D consistency. To address the first challenge, we propose Lite3R, a model-agnostic teacher-student framework that replacesdense attentionwithSparse Linear Attentionto preserve important geometric interactions while reducing attention cost. To address the second challenge, we introduce a parameter-efficientFP8-aware quantization-aware training(FP8-aware QAT) strategy with partialattention distillation, which freezes the vast majority ofpretrained backboneparameters and trains only lightweight linear-branch projection layers, enabling stable low-precision deployment while retaining pretrainedgeometric priors. We further evaluate Lite3R on two representative backbones, VGGT and DA3-Large, over BlendedMVS and DTU64, showing that it substantially reduces latency (1.7-2.0x) and memory usage (1.9-2.4x) while preserving competitive reconstruction quality overall. These results demonstrate that Lite3R provides an effective algorithm-system co-design approach for practicaltransformer-based 3D reconstruction. Code: https://github.com/AIGeeksGroup/Lite3R. Website: https://aigeeksgroup.github.io/Lite3R.
View arXiv pageView PDFProject pageGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.11354
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.11354 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.11354 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.11354 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Geometric Context Transformer for Streaming 3D Reconstruction
Introduces LingBot-Map, a feed-forward 3D foundation model for streaming 3D reconstruction using a geometric context transformer architecture that achieves stable real-time performance at 20 FPS.
robbyant/lingbot-map
LingBot-Map is a feed-forward 3D foundation model for streaming 3D reconstruction that uses a Geometric Context Transformer architecture, achieving state-of-the-art performance with efficient ~20 FPS inference on long sequences exceeding 10,000 frames.
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
AnyRecon proposes a scalable framework for 3D reconstruction from arbitrary sparse inputs using a video diffusion model with persistent scene memory and geometry-aware conditioning.
Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures
Aletheia introduces a gradient-guided layer selection method for efficient LoRA fine-tuning that identifies task-relevant transformer layers via lightweight gradient probes and applies adapters selectively, achieving 15-28% training speedup across 14 models while maintaining downstream performance on MMLU, GSM8K, and HumanEval benchmarks.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
Proposes Memory-Efficient Looped Transformer (MELT), a novel recurrent LLM architecture that decouples reasoning depth from memory consumption by sharing a single KV cache across loops and using chunk-wise training with interpolated transition and attention-aligned distillation.