探析任意视频中的摄像机运动
Towards Understanding Camera Motions in Any Video
April 21, 2025
作者: Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan
cs.AI
摘要
我们推出了CameraBench,这是一个大规模数据集与基准测试平台,旨在评估并提升对摄像机运动的理解。CameraBench包含约3,000段多样化的网络视频,这些视频经过专家严格的多阶段质量控制流程进行标注。我们的贡献之一是与电影摄影师合作设计了一套摄像机运动基本动作的分类体系。例如,我们发现诸如“跟随”(或追踪)这样的动作需要理解场景内容,如移动的主体。我们开展了一项大规模的人类研究,以量化人类标注的表现,结果表明领域专业知识及基于教程的培训能显著提高准确性。举例来说,新手可能会混淆“拉近”(内在参数变化)与“前移”(外在参数变化),但通过培训可以区分二者。利用CameraBench,我们评估了结构光运动(SfM)模型和视频-语言模型(VLMs),发现SfM模型难以捕捉依赖场景内容的语义基本动作,而VLMs则在需要精确估计轨迹的几何基本动作上表现欠佳。随后,我们在CameraBench上微调了一个生成式VLM,以融合两者优势,并展示了其应用,包括运动增强的标题生成、视频问答及视频-文本检索。我们期望我们的分类体系、基准测试及教程能推动未来研究,朝着理解任何视频中摄像机运动的终极目标迈进。
English
We introduce CameraBench, a large-scale dataset and benchmark designed to
assess and improve camera motion understanding. CameraBench consists of ~3,000
diverse internet videos, annotated by experts through a rigorous multi-stage
quality control process. One of our contributions is a taxonomy of camera
motion primitives, designed in collaboration with cinematographers. We find,
for example, that some motions like "follow" (or tracking) require
understanding scene content like moving subjects. We conduct a large-scale
human study to quantify human annotation performance, revealing that domain
expertise and tutorial-based training can significantly enhance accuracy. For
example, a novice may confuse zoom-in (a change of intrinsics) with translating
forward (a change of extrinsics), but can be trained to differentiate the two.
Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language
Models (VLMs), finding that SfM models struggle to capture semantic primitives
that depend on scene content, while VLMs struggle to capture geometric
primitives that require precise estimation of trajectories. We then fine-tune a
generative VLM on CameraBench to achieve the best of both worlds and showcase
its applications, including motion-augmented captioning, video question
answering, and video-text retrieval. We hope our taxonomy, benchmark, and
tutorials will drive future efforts towards the ultimate goal of understanding
camera motions in any video.Summary
AI-Generated Summary