LongAV-Compass：面向T2AV、I2AV與V2AV分鐘級音視頻生成的統一評估

摘要

視聽生成正從短片段快速邁向分鐘級內容，然而現有評估方法仍大多局限於短格式場景。現有基準主要聚焦於5至10秒的文字條件生成，且鮮少支援跨文字、圖像及影片條件模態的統一評估。此外，它們對身分一致性、敘事連貫性及視聽對齊在長時間跨度下的衰退情形提供的洞察有限。為填補此缺口，我們提出LongAV-Compass，一個針對分鐘級視聽生成的系統性基準。LongAV-Compass包含284個精心策劃的測試案例，涵蓋文字轉音視頻（T2AV）、圖像轉音視頻（I2AV）及影片轉音視頻（V2AV），並按應用場景與生成複雜度進行組織。該基準結合了分類學引導的基準建構與一套統一評估框架，該框架整合了多模態大語言模型輔助評估以及互補的感知與多模態指標，包括DINO-v2、ArcFace、CLIP和ImageBind。此框架評估超過20個細粒度維度，涵蓋片段內品質、跨片段一致性、整體敘事連貫性、語義對齊及視聽同步。透過對11個代表性模型的實驗與人類一致性驗證，LongAV-Compass提供了一個診斷測試平台，用於分析當前系統在跨多樣輸入模態下維持連貫、語義對齊且時間一致的分鐘級視聽生成時所存在的限制。

English

Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.