Native Audio-Visual Alignment for Generation
Summary
NAVA proposes a native audio-visual alignment framework for joint audio-video generation using an Align-then-Fuse MMDiT architecture, achieving improved synchronization and controllability with 6.3B parameters.
View Cached Full Text
Cached at: 05/29/26, 03:00 AM
Paper page - Native Audio-Visual Alignment for Generation
Source: https://huggingface.co/papers/2605.30073
Abstract
NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising.
Joint audio-video generationaims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on eitherdual-tower designswithposterior alignmentor fullyunified tri-modal designsthat mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework forjoint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-FuseMMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduceTimbre-in-Context Conditioningto associatereference timbre cueswith corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, preciseaudio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.
View arXiv pageView PDFProject pageGitHub23Add to collection
Get this paper in your agent:
hf papers read 2605\.30073
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper2
#### robingg1/NAVA Text-to-Video• Updatedabout 2 hours ago • 23 • 5
#### ernie-research/NAVA Text-to-Video• Updated35 minutes ago • 2
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.30073 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.30073 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across text, image, and video conditioning modalities, assessing quality, consistency, and alignment over extended temporal sequences.
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
MSAVBench is the first comprehensive benchmark and adaptive evaluation framework for multi-shot audio-video generation, assessing 19 models across diverse tasks and achieving high alignment with human judgment.
Geo-Align: Video Generation Alignment via Metric Geometry Reward
Geo-Align presents a reinforcement learning framework for camera-controlled video re-rendering that improves generalization through scale-aware perceptual rewards and metric 3D estimation for camera trajectory extraction.
EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers
EVA01 is a unified framework that integrates 3D mesh as a native modality into multimodal language models via a Mixture-of-Transformers architecture, enabling state-of-the-art text-to-3D generation and long-context multi-turn geometric editing.
When Vision Speaks for Sound
This paper identifies that video-capable multimodal LLMs often appear to understand audio but actually rely on visual cues, a failure mode termed the audio-visual Clever Hans effect. It introduces Thud, an intervention-driven probing framework to diagnose this issue, and proposes an alignment recipe that improves audio-visual consistency by 28 percentage points.