任意のビデオにおけるカメラモーションの理解に向けて

要旨

私たちは、カメラモーション理解の評価と改善を目的とした大規模データセットおよびベンチマークであるCameraBenchを紹介します。CameraBenchは、約3,000本の多様なインターネット動画で構成され、専門家による厳格な多段階品質管理プロセスを経てアノテーションされています。私たちの貢献の一つは、撮影技師との協力で設計されたカメラモーションの基本要素の分類体系です。例えば、「フォロー」（またはトラッキング）のようなモーションは、移動する被写体などのシーン内容の理解を必要とすることがわかります。大規模な人間による研究を実施し、人間のアノテーション性能を定量化した結果、ドメイン知識とチュートリアルベースのトレーニングが精度を大幅に向上させることが明らかになりました。例えば、初心者はズームイン（内部パラメータの変化）と前方への移動（外部パラメータの変化）を混同する可能性がありますが、トレーニングによってこれらを区別できるようになります。CameraBenchを使用して、Structure-from-Motion（SfM）モデルとVideo-Language Models（VLMs）を評価したところ、SfMモデルはシーン内容に依存する意味論的基本要素を捉えるのに苦労し、VLMsは軌跡の正確な推定を必要とする幾何学的基本要素を捉えるのに苦労することがわかりました。その後、CameraBenchで生成型VLMをファインチューニングし、両方の長所を活かすことで、モーション拡張キャプション、ビデオ質問応答、ビデオテキスト検索などのアプリケーションを実証します。私たちの分類体系、ベンチマーク、チュートリアルが、あらゆる動画におけるカメラモーション理解という究極の目標に向けた今後の取り組みを推進することを期待しています。

English

We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.

任意のビデオにおけるカメラモーションの理解に向けて

Towards Understanding Camera Motions in Any Video

要旨

Support