Native Audio-Visual Alignment for Generation

Hugging Face Daily Papers Papers

Summary

NAVA proposes a native audio-visual alignment framework for joint audio-video generation using an Align-then-Fuse MMDiT architecture, achieving improved synchronization and controllability with 6.3B parameters.

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.
Original Article
View Cached Full Text

Cached at: 05/29/26, 03:00 AM

Paper page - Native Audio-Visual Alignment for Generation

Source: https://huggingface.co/papers/2605.30073

Abstract

NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising.

Joint audio-video generationaims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on eitherdual-tower designswithposterior alignmentor fullyunified tri-modal designsthat mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework forjoint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-FuseMMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduceTimbre-in-Context Conditioningto associatereference timbre cueswith corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, preciseaudio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.

View arXiv pageView PDFProject pageGitHub23Add to collection

Get this paper in your agent:

hf papers read 2605\.30073

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper2

#### robingg1/NAVA Text-to-Video• Updatedabout 2 hours ago • 23 • 5 #### ernie-research/NAVA Text-to-Video• Updated35 minutes ago • 2

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.30073 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30073 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

When Vision Speaks for Sound

Hugging Face Daily Papers

This paper identifies that video-capable multimodal LLMs often appear to understand audio but actually rely on visual cues, a failure mode termed the audio-visual Clever Hans effect. It introduces Thud, an intervention-driven probing framework to diagnose this issue, and proposes an alignment recipe that improves audio-visual consistency by 28 percentage points.