ChatPaper.aiChatPaper

ShotBench:视觉语言模型中的专家级电影理解能力

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

June 26, 2025
作者: Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, Ziwei Liu
cs.AI

摘要

作为电影的基本视觉语言,电影摄影对于传达叙事、情感和美学品质至关重要。尽管近期的视觉-语言模型(VLMs)展现出强大的通用视觉理解能力,它们在解析单个镜头中蕴含的细腻电影语法方面的熟练度仍鲜有探索,且缺乏坚实的评估体系。这一关键空白既限制了细粒度视觉理解,也制约了AI辅助视频生成的精准度。为此,我们推出了ShotBench,一个专为电影语言理解设计的全面基准测试。它包含了超过3,500个由专家标注的问答对,源自200多部广受赞誉(主要为奥斯卡提名)影片的图像和视频片段,覆盖了八个核心电影摄影维度。我们对24个领先的VLMs在ShotBench上的评估揭示了它们的显著局限:即便是表现最佳的模型,其平均准确率也不足60%,尤其是在处理细粒度视觉线索和复杂空间推理时表现欠佳。为加速该领域的发展,我们构建了ShotQA,一个包含约7万电影问答对的大规模多模态数据集。利用ShotQA,我们通过监督微调和群体相对策略优化开发了ShotVL。ShotVL在ShotBench上显著超越了所有现有的开源和专有模型,确立了新的性能标杆。我们开源了模型、数据和代码,以促进AI驱动的电影理解与生成这一关键领域的快速发展。
English
Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits both fine-grained visual comprehension and the precision of AI-assisted video generation. To address this, we introduce ShotBench, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their substantial limitations: even the top-performing model achieves less than 60% average accuracy, particularly struggling with fine-grained visual cues and complex spatial reasoning. To catalyze advancement in this domain, we construct ShotQA, a large-scale multimodal dataset comprising approximately 70k cinematic QA pairs. Leveraging ShotQA, we develop ShotVL through supervised fine-tuning and Group Relative Policy Optimization. ShotVL significantly outperforms all existing open-source and proprietary models on ShotBench, establishing new state-of-the-art performance. We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation.
PDF211June 30, 2025