ViDiC：影片差異描述

摘要

理解動態場景之間的視覺差異，需要具備對構圖、空間與時間變化的比較感知能力——這種能力在現有的視覺語言系統中仍未被充分探索。儘管先前圖像差異描述（IDC）的研究已使模型能夠描述靜態圖像間的語義變化，但這些方法無法捕捉運動連續性、事件演變或隨時間變化的編輯連貫性。我們提出影片差異描述（ViDiC）任務及其對應的ViDiC-1K數據集，旨在評估多模態大語言模型（MLLMs）對影片對之間相似性與差異性進行細粒度描述的能力。ViDiC-1K包含1,000個精選影片對，標注了超過4,000項比較檢查項目，涵蓋七個類別：主體、風格、背景、攝影手法、運動、場景與播放技術。為確保可靠評估，我們基於LLM-as-a-Judge協議提出雙重檢查表框架，分別衡量相似性與差異性的描述準確度。對十九個代表性多模態模型的實驗顯示，它們在比較描述與差異感知能力上存在顯著性能差距。我們期望ViDiC-1K能成為具有挑戰性的基準，為推進多模態智能的影片理解、編輯感知與比較推理奠定堅實基礎。

English

Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes--a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time. We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: subject, style, background, cinematography, motion, location, and playback techniques. To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, based on the LLM-as-a-Judge protocol. Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities. We hope ViDiC-1K can be a challenging benchmark that lays a solid foundation for advancing video understanding, edit awareness, and comparative reasoning in multimodal intelligence.

ViDiC：影片差異描述

ViDiC: Video Difference Captioning

摘要

Support