MSAVBench：邁向多鏡頭音頻視頻生成的全面且可靠評估

摘要

影片生成正從單鏡頭合成快速演進至複雜的多鏡頭音視頻（MSAV）敘事，以因應現實世界的需求。然而，評估這類前沿模型仍是一項根本挑戰。現有基準在範疇與數據多樣性上有所侷限，且依賴僵化的評估流程，無法對現代MSAV模型進行系統性且可靠的評估。為彌補這些差距，我們提出MSAVBench，這是首個專為多鏡頭音視頻生成設計的綜合性基準與自適應混合評估框架。我們的基準涵蓋四大關鍵維度：影片、音訊、鏡頭與參考，包含多樣化的任務設定、最高達15個鏡頭的變換數量，以及具挑戰性的非寫實場景。我們的評估框架透過自適應自我修正機制進行鏡頭分割、採用實例級評分標準處理主觀指標，以及基於工具的證據提取進行複雜判斷，從而提升評估的穩健性。此外，MSAVBench與人類判斷高度一致，達到91.5%的斯皮爾曼等級相關係數。我們對19個當前最先進的封閉源與開放源模型進行的系統性評估顯示，現有系統在導演級控制與細緻的音畫同步方面仍面臨挑戰，而模組化或代理式生成管線則為縮小開放源與封閉源模型之間的差距提供了具潛力的途徑。我們將公開基準數據與評估程式碼，以促進未來研究。

English

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.