Lance: Unified Multimodal Modeling by Multi-Task Synergy

Hugging Face Daily Papers Papers

Summary

Lance is a unified multimodal model that leverages a dual-stream mixture-of-experts architecture and collaborative multi-task training to achieve strong performance in understanding, generation, and editing of both images and videos, outperforming existing open-source unified models.

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:30 AM

Paper page - Lance: Unified Multimodal Modeling by Multi-Task Synergy

Source: https://huggingface.co/papers/2605.18678 Published on May 18

#2 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

Abstract

Lance is a unified multimodal model that combines understanding, generation, and editing capabilities for images and videos through collaborative multi-task training and a dual-stream architecture.

We present Lance, a lightweight native unified model supportingmultimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling viacollaborative multi-task training. It is grounded in two core principles:unified context modelinganddecoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-streammixture-of-experts architectureon shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introducemodality-aware rotary positional encodingto mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts astaged multi-task trainingparadigm with capability-oriented objectives andadaptive data schedulingto strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strongmultimodal understandingcapabilities. The homepage is available at https://lance-project.github.io.

View arXiv pageView PDFProject pageGitHub134Add to collection

Get this paper in your agent:

hf papers read 2605\.18678

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### bytedance-research/Lance Any-to-Any• Updatedabout 3 hours ago • 94

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.18678 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.18678 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

bytedance-research/Lance

Hugging Face Models Trending

ByteDance Research introduces Lance, a 3B-parameter unified multimodal model trained from scratch on 128 A100 GPUs, capable of image and video understanding, generation, and editing within a single framework.