ChatPaper.aiChatPaper

T2AV-Compass:邁向文字生成音視訊的統一評估框架

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

December 24, 2025
作者: Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, Yize Zhang, Zhaoxiang Zhang, Jiaheng Liu
cs.AI

摘要

文字轉音視訊(T2AV)生成技術旨在從自然語言合成時間連貫的視訊與語義同步的音訊,然其評估體系仍呈碎片化,常依賴單模態指標或狹隘基準,難以捕捉複雜提示下的跨模態對齊、指令遵循及感知真實性。為解決此局限,我們提出T2AV-Compass——一個統一的T2AV系統綜合評估基準,包含經分類學驅動流程構建的500個多樣化複雜提示,確保語義豐富性與物理合理性。此外,T2AV-Compass引入雙層評估框架,整合客觀信號級指標(用於視訊品質、音訊品質與跨模態對齊)與主觀「MLLM-as-a-Judge」協議(用於指令遵循與真實性評估)。對11個代表性T2AV系統的廣泛測試表明,即便最強模型仍與人類級真實性及跨模態一致性存在顯著差距,且在音訊真實性、細粒度同步、指令遵循等方面存在持續缺陷。這些結果揭示了未來模型的巨大改進空間,並彰顯T2AV-Compass作為推動文字轉音視訊生成技術發展的挑戰性診斷測試平台價值。
English
Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.
PDF221December 26, 2025