ViDiC：视频差异描述

摘要

理解动态场景间的视觉差异需要具备对构图、空间与时间变化的比较感知能力——这种能力在现有视觉语言系统中仍未得到充分探索。尽管先前图像差异描述（IDC）研究已使模型能够描述静态图像间的语义变化，但这些方法无法捕捉运动连续性、事件演变或时间维度上的编辑一致性。我们提出视频差异描述（ViDiC）任务及其配套的ViDiC-1K数据集，旨在评估多模态大语言模型（MLLMs）对视频对之间相似性与差异性的细粒度描述能力。ViDiC-1K包含1,000组精心筛选的视频对，标注了超过4,000项比较清单条目，涵盖主体、风格、背景、镜头语言、运动、场景和播放技术七大类别。为确保可靠评估，我们基于LLM-as-a-Judge协议提出双清单评估框架，分别衡量相似性与差异性的识别准确率。在19个代表性多模态模型上的实验表明，它们在比较描述与差异感知能力上存在显著性能差距。我们希望ViDiC-1K能成为具有挑战性的基准测试，为推进多模态智能的视频理解、编辑感知与比较推理能力奠定坚实基础。

English

Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes--a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time. We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: subject, style, background, cinematography, motion, location, and playback techniques. To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, based on the LLM-as-a-Judge protocol. Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities. We hope ViDiC-1K can be a challenging benchmark that lays a solid foundation for advancing video understanding, edit awareness, and comparative reasoning in multimodal intelligence.

ViDiC：视频差异描述

ViDiC: Video Difference Captioning

摘要

Support