SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
Summary
This paper introduces SenseNova-U1, a unified multimodal architecture that integrates understanding and generation tasks, releasing two variants (8B and 30B) that perform competitively in both perception and image synthesis.
View Cached Full Text
Cached at: 05/13/26, 04:10 AM
Paper page - SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
Source: https://huggingface.co/papers/2605.12500 Published on May 12
#2 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Unified vision-language models treat understanding and generation as integrated processes rather than separate tasks, demonstrating strong performance across multiple multimodal capabilities including image synthesis and action reasoning.
Recent largevision-language models(VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of nativemultimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built uponNEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) andmixture-of-experts(30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding,vision-language perception,knowledge reasoning,agentic decision-making, andspatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complextext-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly invision-language-action(VLA) andworld model(WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.
View arXiv pageView PDFGitHub1.58kAdd to collection
Get this paper in your agent:
hf papers read 2605\.12500
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.12500 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.12500 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.12500 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
sensenova/SenseNova-U1-8B-MoT
SenseNova U1 is a new series of native multimodal models that unify understanding and generation within a single architecture using the NEO-Unify framework, eliminating the need for separate visual encoders or VAEs.
@heyshrutimishra: NEW: A model that thinks while it draws. SenseNova U1 is one model that handles understanding, reasoning, and generatio…
SenseNova U1 is a unified model that handles understanding, reasoning, and generation of text and images in the same architecture, enabling tasks like planning infographics end-to-end.
SenseNova U1 dropped an infographic-specific finetune
SenseNova U1 releases an infographic-specific finetune of its U1-8B-MoT base model, achieving significant benchmark improvements in infographic accuracy, chart understanding, and text rendering.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
This paper introduces UNO, an Understanding-Oriented Post-Training framework that uses comprehension tasks as supervisory signals to enhance image generation and editing in unified multimodal models.
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
NVIDIA releases Nemotron 3 Nano Omni, a 30B parameter multimodal model capable of processing video, audio, image, and text with integrated reasoning capabilities for enterprise workflows.