LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
Summary
LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across text, image, and video conditioning modalities, assessing quality, consistency, and alignment over extended temporal sequences.
View Cached Full Text
Cached at: 05/27/26, 02:47 AM
Paper page - LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
Source: https://huggingface.co/papers/2605.26244 Published on May 25
#3 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences.
Audio-visual generationis rapidly advancing from shortclips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existingbenchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency,narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematicbenchmarkfor minute-longaudio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. Thebenchmarkcombines taxonomy-guidedbenchmarkconstruction with a unifiedevaluation frameworkthat integratesMLLM-assisted assessmentwith complementary perceptual andmultimodal metrics, includingDINO-v2,ArcFace,CLIP, andImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, globalnarrative coherence, semantic alignment, andaudio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scaleaudio-visual generationacross diverse input modalities.
View arXiv pageView PDFGitHubAdd to collection
Get this paper in your agent:
hf papers read 2605\.26244
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.26244 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.26244 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.26244 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
MSAVBench is the first comprehensive benchmark and adaptive evaluation framework for multi-shot audio-video generation, assessing 19 models across diverse tasks and achieving high alignment with human judgment.
WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models
WebCompass is a multimodal benchmark for evaluating LLMs on web coding tasks across three input modalities (text, image, video) and three task types (generation, editing, repair). It introduces an Agent-as-a-Judge paradigm that autonomously executes generated websites in a real browser to assess visual fidelity and interactivity.
Benchmarking Speech-to-Speech Translation Models
COMPASS is a unified benchmarking framework for speech-to-speech translation (S2ST) that integrates 46 metrics across eight dimensions, evaluated on 1,248 model-language configurations. It identifies complementary architecture strengths and proposes reduced metric subsets that preserve rankings while cutting evaluation time.
When Vision Speaks for Sound
This paper identifies that video-capable multimodal LLMs often appear to understand audio but actually rely on visual cues, a failure mode termed the audio-visual Clever Hans effect. It introduces Thud, an intervention-driven probing framework to diagnose this issue, and proposes an alignment recipe that improves audio-visual consistency by 28 percentage points.
Native Audio-Visual Alignment for Generation
NAVA proposes a native audio-visual alignment framework for joint audio-video generation using an Align-then-Fuse MMDiT architecture, achieving improved synchronization and controllability with 6.3B parameters.