MSAVBench: 포괄적이고 신뢰할 수 있는 멀티샷 오디오-비디오 생성을 위한 평가

초록

비디오 생성은 현실 세계의 요구를 충족시키기 위해 단일 샷 합성에서 복잡한 멀티샷 오디오-비디오(MSAV) 내러티브로 빠르게 진화하고 있다. 그러나 이러한 최첨단 모델을 평가하는 것은 여전히 근본적인 과제로 남아 있다. 기존 벤치마크는 범위와 데이터 다양성이 제한적이며, 경직된 평가 파이프라인에 의존하여 현대 MSAV 모델의 체계적이고 신뢰할 수 있는 평가를 어렵게 만든다. 이러한 격차를 해소하기 위해, 우리는 MSAVBench, 즉 멀티샷 오디오-비디오 생성을 위한 최초의 포괄적 벤치마크이자 적응형 하이브리드 평가 프레임워크를 소개한다. 우리의 벤치마크는 비디오, 오디오, 샷, 참조라는 네 가지 핵심 차원을 포괄하며, 다양한 작업 설정, 최대 15개의 다양한 샷 수, 그리고 도전적인 비현실적 시나리오를 포함한다. 우리의 평가 프레임워크는 샷 분할을 위한 적응형 자기 수정 메커니즘, 주관적 지표를 위한 인스턴스별 루브릭, 그리고 복잡한 판단을 위한 도구 기반 증거 추출을 통해 견고성을 향상시킨다. 또한, MSAVBench는 인간의 판단과 높은 일치도를 보여 스피어만 순위 상관계수 91.5%를 달성한다. 19개의 최첨단 폐쇄형 및 오픈소스 모델에 대한 체계적 평가 결과, 현재 시스템은 여전히 감독 수준의 제어와 세밀한 시청각 동기화에 어려움을 겪는 반면, 모듈형 또는 에이전트 기반 생성 파이프라인이 오픈소스와 폐쇄형 모델 간의 격차를 좁히는 유망한 경로를 제공함을 보여준다. 우리는 향후 연구를 촉진하기 위해 벤치마크 데이터와 평가 코드를 공개할 예정이다.

English

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.