MSAVBench：面向多镜头音视频生成的全面可靠评估

摘要

视频生成正从单镜头合成迅速演进至复杂的多镜头音视频（MSAV）叙事，以满足现实世界的需求。然而，评估此类前沿模型仍是一项根本性挑战。现有基准在覆盖范围和数据类型上存在局限，且依赖僵化的评估流水线，无法对现代MSAV模型进行系统且可靠的评估。为弥补这些不足，我们提出了MSAVBench——首个面向多镜头音视频生成的综合基准与自适应混合评估框架。我们的基准涵盖四个关键维度：视频、音频、镜头与参考，涉及多样化的任务设置、最多达15个镜头的可变数量以及具有挑战性的非真实场景。我们的评估框架通过以下机制提升鲁棒性：镜头分割的自适应自修正机制、主观指标的实例级评分规则，以及面向复杂判断的基于工具的证据提取方法。此外，MSAVBench与人类判断高度一致，斯皮尔曼等级相关系数达到91.5%。我们对19个最先进的闭源与开源模型进行了系统评估，结果表明当前系统在导演级控制与细粒度音画同步方面仍存在困难，而模块化或代理式生成流水线则为缩小开源与闭源模型之间的差距提供了有前景的路径。我们将公开基准数据与评估代码，以促进未来研究。

English

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.