Tag
This paper introduces Audio-Interaction, a unified streaming audio model that combines offline task execution with real-time audio instruction following via an end-to-end framework. It proposes SoundFlow for the perceive-decide-respond loop and evaluates competitive performance across benchmarks.
Introduces Representation Forcing (RF), a technique that enables unified multimodal models to perform both perception and generation end-to-end without external VAE latent spaces, matching state-of-the-art VAE-based models in image generation while improving understanding.
Lumos-Nexus is a training-efficient video generation framework that uses a two-stage design with a lightweight generator for training and a high-capacity pretrained generator for inference, achieving enhanced visual fidelity through Unified Progressive Frequency Bridging.
Presents a unified neural scaling law that accurately models deep neural network scaling across multiple dimensions including parameters, dataset size, training steps, and compute, validated across diverse architectures and tasks.
UniT is a unified feed-forward model for geometry perception using a Group Autoregressive Transformer that integrates multiple paradigms (online/offline, multi-modal, long-horizon) while maintaining metric-scale accuracy via scale-adaptive loss and queue-style KV caching. It achieves state-of-the-art performance on ten benchmarks spanning seven tasks.
Uni-Edit proposes using intelligent image editing as a single general task to simultaneously improve unified multimodal models' understanding, generation, and editing capabilities, with an automated data synthesis pipeline creating complex editing instructions.
Lance is a unified multimodal model that leverages a dual-stream mixture-of-experts architecture and collaborative multi-task training to achieve strong performance in understanding, generation, and editing of both images and videos, outperforming existing open-source unified models.
ByteDance Research introduces Lance, a 3B-parameter unified multimodal model trained from scratch on 128 A100 GPUs, capable of image and video understanding, generation, and editing within a single framework.
Tuna-2 is a unified multimodal model that achieves state-of-the-art performance by processing visual understanding and generation directly from pixel embeddings, eliminating the need for pretrained vision encoders.
LLaDA2.0-Uni unifies multimodal understanding and generation within a single diffusion-based large language model architecture.
UniCorn is a framework that enables unified multimodal models to self-improve by using a multi-agent system for prompt generation, image creation, and quality evaluation, achieving state-of-the-art results on text-to-image benchmarks like TIIF, WISE, and OneIG-EN.