探析任意视频中的摄像机运动

摘要

我们推出了CameraBench，这是一个大规模数据集与基准测试平台，旨在评估并提升对摄像机运动的理解。CameraBench包含约3,000段多样化的网络视频，这些视频经过专家严格的多阶段质量控制流程进行标注。我们的贡献之一是与电影摄影师合作设计了一套摄像机运动基本动作的分类体系。例如，我们发现诸如“跟随”（或追踪）这样的动作需要理解场景内容，如移动的主体。我们开展了一项大规模的人类研究，以量化人类标注的表现，结果表明领域专业知识及基于教程的培训能显著提高准确性。举例来说，新手可能会混淆“拉近”（内在参数变化）与“前移”（外在参数变化），但通过培训可以区分二者。利用CameraBench，我们评估了结构光运动（SfM）模型和视频-语言模型（VLMs），发现SfM模型难以捕捉依赖场景内容的语义基本动作，而VLMs则在需要精确估计轨迹的几何基本动作上表现欠佳。随后，我们在CameraBench上微调了一个生成式VLM，以融合两者优势，并展示了其应用，包括运动增强的标题生成、视频问答及视频-文本检索。我们期望我们的分类体系、基准测试及教程能推动未来研究，朝着理解任何视频中摄像机运动的终极目标迈进。

English

We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.

探析任意视频中的摄像机运动

Towards Understanding Camera Motions in Any Video

摘要

Support