UniT: Unified Geometry Learning with Group Autoregressive Transformer
Summary
UniT is a unified feed-forward model for geometry perception using a Group Autoregressive Transformer that integrates multiple paradigms (online/offline, multi-modal, long-horizon) while maintaining metric-scale accuracy via scale-adaptive loss and queue-style KV caching. It achieves state-of-the-art performance on ten benchmarks spanning seven tasks.
View Cached Full Text
Cached at: 05/21/26, 06:20 AM
Paper page - UniT: Unified Geometry Learning with Group Autoregressive Transformer
Source: https://huggingface.co/papers/2605.21131
Abstract
UniT presents a unified feed-forward model for geometry perception using a Group Autoregressive Transformer that integrates multiple paradigms while maintaining metric-scale accuracy through scale-adaptive loss and queue-style KV caching.
Recent feed-forward models have significantly advancedgeometry perceptionfor inferringdense 3D structurefromsensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, includingonline perception,offline reconstruction,multi-modal integration,long-horizon scalability, andmetric-scale estimation. We present UniT, aunified modelbuilt upon a novelGroup Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups ofsensor observationsas the basicautoregressive unitsand predict the correspondingpoint mapsin ananchor-freeandscale-adaptivemanner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, aqueue-style KV cachingmechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames throughanchor-freerelational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, ascale-adaptive geometry lossis further introduced within this framework. It couplesrelative geometric constraintswith a partialabsolute scale term, implicitly regularizing global scale and inducing aprogressive transitionfrom scale-invariant geometry to metric-scale solutions. Together with a dedicatedmodal attention modulefor integrating auxiliary modalities, UniT achieves state-of-the-art performance in unifiedgeometry perception, as validated on ten benchmarks spanning seven representative tasks.
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2605\.21131
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.21131 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.21131 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.21131 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision
UniCorn is a framework that enables unified multimodal models to self-improve by using a multi-agent system for prompt generation, image creation, and quality evaluation, achieving state-of-the-art results on text-to-image benchmarks like TIIF, WISE, and OneIG-EN.
EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers
EVA01 is a unified framework that integrates 3D mesh as a native modality into multimodal language models via a Mixture-of-Transformers architecture, enabling state-of-the-art text-to-3D generation and long-context multi-turn geometric editing.
Towards Consistent Video Geometry Estimation
ViGeo is a transformer-based foundation model that recovers dense and consistent 3D geometry from videos using dynamic chunking attention and a completion-based data refinement framework, achieving state-of-the-art performance across multiple tasks.
RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases
This paper introduces RelGT-AC, a relational graph transformer architecture tailored for autocomplete tasks in relational databases. The model extends the RelGT architecture with column masking to prevent trivial solutions, a unified task head for multiple prediction types, and a TF-IDF text encoder to leverage lexical signals, achieving significant improvements over baselines on RelBench v2 benchmarks.
Geometric Context Transformer for Streaming 3D Reconstruction
Introduces LingBot-Map, a feed-forward 3D foundation model for streaming 3D reconstruction using a geometric context transformer architecture that achieves stable real-time performance at 20 FPS.