ChatPaper.aiChatPaper

LongAV-Compass:面向T2AV、I2AV及V2AV的分钟级音视频生成统一评估方法

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

May 25, 2026
作者: Tengfei Liu, Yang Shi, Xuanyu Zhu, Jiafu Tang, Liu Yang, Qixun Wang, Zhuoran Zhang, Yuqi Tang, Fengxiang Wang, Yuhao Dong, Xinlong Chen, Bozhou Li, Bohan Zeng, Yue Ding, Xiaohan Zhang, Jialu Chen, Haotian Wang, Yuanxing Zhang, Pengfei Wan, Leye Wang
cs.AI

摘要

视听生成正从短片段快速向分钟级内容演进,而现有评估体系仍主要局限于短视频设定。现有基准测试主要聚焦于5至10秒的文本条件生成,且极少支持跨文本、图像和视频条件模态的统一评估。此外,这些基准对身份一致性、叙事连贯性及视听同步性随时间跨度退化的机制仅能提供有限洞察。为填补这一空白,我们提出LongAV-Compass——一个面向分钟级视听生成的系统性基准测试集。LongAV-Compass包含284个精心策划的测试案例,覆盖文本到音视频(T2AV)、图像到音视频(I2AV)及视频到音视频(V2AV)三类任务,并按应用场景与生成复杂度进行组织。该基准融合了基于分类学的基准构建方法,以及集成多模态大模型辅助评估与互补性感知及多模态指标(包括DINO-v2、ArcFace、CLIP和ImageBind)的统一评估框架。该框架评估超过20个细粒度维度,涵盖片段内质量、跨片段一致性、全局叙事连贯性、语义对齐及视听同步性。通过对11个代表性模型进行实验并开展人工对齐验证,LongAV-Compass提供了一个诊断性测试平台,用于分析当前系统在不同输入模态下维持连贯、语义对齐且时间一致的分钟级视听生成能力时的局限性。
English
Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.