Lance: Unified Multimodal Modeling by Multi-Task Synergy
Summary
Lance is a unified multimodal model that leverages a dual-stream mixture-of-experts architecture and collaborative multi-task training to achieve strong performance in understanding, generation, and editing of both images and videos, outperforming existing open-source unified models.
View Cached Full Text
Cached at: 05/19/26, 06:30 AM
Paper page - Lance: Unified Multimodal Modeling by Multi-Task Synergy
Source: https://huggingface.co/papers/2605.18678 Published on May 18
#2 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
Abstract
Lance is a unified multimodal model that combines understanding, generation, and editing capabilities for images and videos through collaborative multi-task training and a dual-stream architecture.
We present Lance, a lightweight native unified model supportingmultimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling viacollaborative multi-task training. It is grounded in two core principles:unified context modelinganddecoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-streammixture-of-experts architectureon shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introducemodality-aware rotary positional encodingto mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts astaged multi-task trainingparadigm with capability-oriented objectives andadaptive data schedulingto strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strongmultimodal understandingcapabilities. The homepage is available at https://lance-project.github.io.
View arXiv pageView PDFProject pageGitHub134Add to collection
Get this paper in your agent:
hf papers read 2605\.18678
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### bytedance-research/Lance Any-to-Any• Updatedabout 3 hours ago • 94
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.18678 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.18678 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Show HN: Lance – image/video generation and understanding in one model
ByteDance releases Lance, a 3B parameter unified multimodal model supporting image and video generation, understanding, and editing, trained from scratch with a multi-task recipe.
bytedance-research/Lance
ByteDance Research introduces Lance, a 3B-parameter unified multimodal model trained from scratch on 128 A100 GPUs, capable of image and video understanding, generation, and editing within a single framework.
LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing
LoomVideo introduces a 5B-parameter unified architecture for video generation and editing that reduces computational overhead using novel conditioning mechanisms and multi-modal alignment, achieving competitive performance and faster inference.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
The paper introduces JoyAI-Image, a unified multimodal foundation model that integrates a spatially enhanced MLLM with MMDiT to achieve state-of-the-art performance in visual understanding, text-to-image generation, and instruction-guided editing.
UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision
UniCorn is a framework that enables unified multimodal models to self-improve by using a multi-agent system for prompt generation, image creation, and quality evaluation, achieving state-of-the-art results on text-to-image benchmarks like TIIF, WISE, and OneIG-EN.