UniT: Unified Geometry Learning with Group Autoregressive Transformer

Hugging Face Daily Papers 05/20/26, 12:00 AM Papers

Summary

UniT is a unified feed-forward model for geometry perception using a Group Autoregressive Transformer that integrates multiple paradigms (online/offline, multi-modal, long-horizon) while maintaining metric-scale accuracy via scale-adaptive loss and queue-style KV caching. It achieves state-of-the-art performance on ten benchmarks spanning seven tasks.

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.

Original Article

View Cached Full Text

Cached at: 05/21/26, 06:20 AM

Paper page - UniT: Unified Geometry Learning with Group Autoregressive Transformer

Source: https://huggingface.co/papers/2605.21131

Abstract

UniT presents a unified feed-forward model for geometry perception using a Group Autoregressive Transformer that integrates multiple paradigms while maintaining metric-scale accuracy through scale-adaptive loss and queue-style KV caching.

Recent feed-forward models have significantly advancedgeometry perceptionfor inferringdense 3D structurefromsensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, includingonline perception,offline reconstruction,multi-modal integration,long-horizon scalability, andmetric-scale estimation. We present UniT, aunified modelbuilt upon a novelGroup Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups ofsensor observationsas the basicautoregressive unitsand predict the correspondingpoint mapsin ananchor-freeandscale-adaptivemanner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, aqueue-style KV cachingmechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames throughanchor-freerelational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, ascale-adaptive geometry lossis further introduced within this framework. It couplesrelative geometric constraintswith a partialabsolute scale term, implicitly regularizing global scale and inducing aprogressive transitionfrom scale-invariant geometry to metric-scale solutions. Together with a dedicatedmodal attention modulefor integrating auxiliary modalities, UniT achieves state-of-the-art performance in unifiedgeometry perception, as validated on ten benchmarks spanning seven representative tasks.

View arXiv page View PDF Project page GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2605\.21131

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.21131 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.21131 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.21131 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

UniT: Unified Geometry Learning with Group Autoregressive Transformer

Paper page - UniT: Unified Geometry Learning with Group Autoregressive Transformer

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

Towards Consistent Video Geometry Estimation

RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

Geometric Context Transformer for Streaming 3D Reconstruction

Submit Feedback

Similar Articles

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

Towards Consistent Video Geometry Estimation

RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

Geometric Context Transformer for Streaming 3D Reconstruction