Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification
Summary
UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing.
View Cached Full Text
Cached at: 06/17/26, 03:36 AM
Paper page - Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification
Source: https://huggingface.co/papers/2606.18249
Abstract
UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing through multi-level feature fusion, bitwise quantization, and parallel prediction.
Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, aunified autoregressive frameworkwhere a singlediscrete visual tokenizerserves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder withmulti-level feature fusionand alookup-free bitwise quantizationscheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adoptsparallel-bitwise-predictionto jointly predict spatially grouped, multi-level visual codes, substantially reducingvisual sequence lengthand accelerating generation. Finally, adiffusion-based visual decoderoperates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed bysupervised fine-tuningandreinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.
View arXiv pageView PDFProject pageGitHub2Add to collection
Models citing this paper2
#### ShareLab-SII/UniAR-RL Image-to-Text• 10B• Updatedabout 2 hours ago
#### ShareLab-SII/UniAR-SFT Image-to-Text• 10B• Updatedabout 2 hours ago
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.18249 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.18249 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations
ARM presents a unified autoregressive framework for image understanding, generation, and editing using discrete semantic tokenization and reinforcement learning optimization, showing cross-task synergy.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
This paper introduces UNO, an Understanding-Oriented Post-Training framework that uses comprehension tasks as supervisory signals to enhance image generation and editing in unified multimodal models.
UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision
UniCorn is a framework that enables unified multimodal models to self-improve by using a multi-agent system for prompt generation, image creation, and quality evaluation, achieving state-of-the-art results on text-to-image benchmarks like TIIF, WISE, and OneIG-EN.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
The paper introduces JoyAI-Image, a unified multimodal foundation model that integrates a spatially enhanced MLLM with MMDiT to achieve state-of-the-art performance in visual understanding, text-to-image generation, and instruction-guided editing.
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
HYDRA-X presents a unified multimodal model that integrates image and video tokenization within a single Vision Transformer, achieving strong performance across understanding and generation tasks.