Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Hugging Face Daily Papers Papers

Summary

UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing.

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.
Original Article
View Cached Full Text

Cached at: 06/17/26, 03:36 AM

Paper page - Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Source: https://huggingface.co/papers/2606.18249

Abstract

UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing through multi-level feature fusion, bitwise quantization, and parallel prediction.

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, aunified autoregressive frameworkwhere a singlediscrete visual tokenizerserves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder withmulti-level feature fusionand alookup-free bitwise quantizationscheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adoptsparallel-bitwise-predictionto jointly predict spatially grouped, multi-level visual codes, substantially reducingvisual sequence lengthand accelerating generation. Finally, adiffusion-based visual decoderoperates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed bysupervised fine-tuningandreinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

View arXiv pageView PDFProject pageGitHub2Add to collection

Models citing this paper2

#### ShareLab-SII/UniAR-RL Image-to-Text• 10B• Updatedabout 2 hours ago #### ShareLab-SII/UniAR-SFT Image-to-Text• 10B• Updatedabout 2 hours ago

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.18249 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.18249 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles