ChatPaper.aiChatPaper

ShotBench:視覺語言模型中的專家級電影理解能力

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

June 26, 2025
作者: Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, Ziwei Liu
cs.AI

摘要

電影攝影,作為電影的基本視覺語言,對於傳達敘事、情感和美學品質至關重要。儘管近期的視覺-語言模型(VLMs)展現出強大的通用視覺理解能力,但其在理解單個鏡頭中蘊含的細膩電影語法方面的熟練程度仍大多未被探索,且缺乏堅實的評估。這一關鍵缺口既限制了細粒度視覺理解,也制約了AI輔助視頻生成的精確性。為此,我們推出了ShotBench,一個專為電影語言理解設計的綜合基準測試。它包含了超過3.5千個由專家標註的圖像與視頻片段問答對,這些數據精選自200多部廣受好評(主要為奧斯卡提名)的電影,涵蓋了八個關鍵的電影攝影維度。我們對24個領先的VLMs在ShotBench上的評估揭示了它們的顯著局限性:即便是表現最佳的模型,其平均準確率也不足60%,尤其是在處理細微視覺線索和複雜空間推理時表現欠佳。為推動該領域的進步,我們構建了ShotQA,一個包含約7萬個電影問答對的大規模多模態數據集。利用ShotQA,我們通過監督微調和群組相對策略優化開發了ShotVL。ShotVL在ShotBench上顯著超越了所有現有的開源和專有模型,創立了新的性能標杆。我們開源了我們的模型、數據和代碼,以促進這一AI驅動的電影理解與生成關鍵領域的快速發展。
English
Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits both fine-grained visual comprehension and the precision of AI-assisted video generation. To address this, we introduce ShotBench, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their substantial limitations: even the top-performing model achieves less than 60% average accuracy, particularly struggling with fine-grained visual cues and complex spatial reasoning. To catalyze advancement in this domain, we construct ShotQA, a large-scale multimodal dataset comprising approximately 70k cinematic QA pairs. Leveraging ShotQA, we develop ShotVL through supervised fine-tuning and Group Relative Policy Optimization. ShotVL significantly outperforms all existing open-source and proprietary models on ShotBench, establishing new state-of-the-art performance. We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation.
PDF211June 30, 2025