Native Audio-Visual Alignment for Generation

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

audio-visual generation synchronization multi-modal diffusion alignment

Summary

NAVA proposes a native audio-visual alignment framework for joint audio-video generation using an Align-then-Fuse MMDiT architecture, achieving improved synchronization and controllability with 6.3B parameters.

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.

Original Article

View Cached Full Text

Cached at: 05/29/26, 03:00 AM

Paper page - Native Audio-Visual Alignment for Generation

Source: https://huggingface.co/papers/2605.30073

Abstract

NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising.

Joint audio-video generationaims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on eitherdual-tower designswithposterior alignmentor fullyunified tri-modal designsthat mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework forjoint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-FuseMMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduceTimbre-in-Context Conditioningto associatereference timbre cueswith corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, preciseaudio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.

View arXiv page View PDF Project page GitHub23 Add to collection

Get this paper in your agent:

hf papers read 2605\.30073

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper2

#### robingg1/NAVA Text-to-Video• Updatedabout 2 hours ago • 23 • 5 #### ernie-research/NAVA Text-to-Video• Updated35 minutes ago • 2

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.30073 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30073 in a Space README.md to link it from this page.

Native Audio-Visual Alignment for Generation

Paper page - Native Audio-Visual Alignment for Generation

Abstract

Models citing this paper2

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

Geo-Align: Video Generation Alignment via Metric Geometry Reward

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

When Vision Speaks for Sound

Submit Feedback

Similar Articles

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

Geo-Align: Video Generation Alignment via Metric Geometry Reward

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers