Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Hugging Face Daily Papers 06/16/26, 12:00 AM Papers

multimodal autoregressive visual-tokenizer image-generation image-editing unified-modeling state-of-the-art

Summary

UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing.

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

Original Article

View Cached Full Text

Cached at: 06/17/26, 03:36 AM

Paper page - Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Source: https://huggingface.co/papers/2606.18249

Abstract

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, aunified autoregressive frameworkwhere a singlediscrete visual tokenizerserves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder withmulti-level feature fusionand alookup-free bitwise quantizationscheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adoptsparallel-bitwise-predictionto jointly predict spatially grouped, multi-level visual codes, substantially reducingvisual sequence lengthand accelerating generation. Finally, adiffusion-based visual decoderoperates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed bysupervised fine-tuningandreinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

View arXiv page View PDF Project page GitHub2 Add to collection

Models citing this paper2

#### ShareLab-SII/UniAR-RL Image-to-Text• 10B• Updatedabout 2 hours ago #### ShareLab-SII/UniAR-SFT Image-to-Text• 10B• Updatedabout 2 hours ago

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.18249 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.18249 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Paper page - Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Abstract

Models citing this paper2

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Submit Feedback

Similar Articles

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers