LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Hugging Face Daily Papers Papers

Summary

LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across text, image, and video conditioning modalities, assessing quality, consistency, and alignment over extended temporal sequences.

Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.
Original Article
View Cached Full Text

Cached at: 05/27/26, 02:47 AM

Paper page - LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Source: https://huggingface.co/papers/2605.26244 Published on May 25

#3 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences.

Audio-visual generationis rapidly advancing from shortclips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existingbenchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency,narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematicbenchmarkfor minute-longaudio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. Thebenchmarkcombines taxonomy-guidedbenchmarkconstruction with a unifiedevaluation frameworkthat integratesMLLM-assisted assessmentwith complementary perceptual andmultimodal metrics, includingDINO-v2,ArcFace,CLIP, andImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, globalnarrative coherence, semantic alignment, andaudio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scaleaudio-visual generationacross diverse input modalities.

View arXiv pageView PDFGitHubAdd to collection

Get this paper in your agent:

hf papers read 2605\.26244

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.26244 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.26244 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.26244 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

Hugging Face Daily Papers

WebCompass is a multimodal benchmark for evaluating LLMs on web coding tasks across three input modalities (text, image, video) and three task types (generation, editing, repair). It introduces an Agent-as-a-Judge paradigm that autonomously executes generated websites in a real browser to assess visual fidelity and interactivity.

Benchmarking Speech-to-Speech Translation Models

arXiv cs.CL

COMPASS is a unified benchmarking framework for speech-to-speech translation (S2ST) that integrates 46 metrics across eight dimensions, evaluated on 1,248 model-language configurations. It identifies complementary architecture strengths and proposes reduced metric subsets that preserve rankings while cutting evaluation time.

When Vision Speaks for Sound

Hugging Face Daily Papers

This paper identifies that video-capable multimodal LLMs often appear to understand audio but actually rely on visual cues, a failure mode termed the audio-visual Clever Hans effect. It introduces Thud, an intervention-driven probing framework to diagnose this issue, and proposes an alignment recipe that improves audio-visual consistency by 28 percentage points.

Native Audio-Visual Alignment for Generation

Hugging Face Daily Papers

NAVA proposes a native audio-visual alignment framework for joint audio-video generation using an Align-then-Fuse MMDiT architecture, achieving improved synchronization and controllability with 6.3B parameters.