3次元ポイントトラックを用いた生成的ビデオモーション編集

要旨

カメラと物体の運動は、ビデオの叙述において中心的な役割を果たす。しかし、特に複雑な物体運動下では、これらの撮影された運動を精密に編集することは依然として大きな課題である。現在のモーション制御画像-動画（I2V）手法は、一貫性のある動画編集のための完全なシーンコンテキストを欠くことが多く、一方で動画-動画（V2V）手法は視点の変化や基本的な物体移動を提供するが、細粒度の物体運動に対する制御は限定的である。本論文では、カメラと物体の運動を共同編集可能にするトラック条件付きV2Vフレームワークを提案する。これを実現するため、動画生成モデルをソース動画と、ソース及びターゲットの運動を表現する対をなす3次元ポイントトラックに条件付けする。これらの3次元トラックは、時空間的一貫性を保ちながら、ソース動画から豊富なコンテキストを新しい運動に転送する疎な対応関係を確立する。決定的に、2次元トラックと比較して、3次元トラックは明示的な深度手がかりを提供し、モデルが深度順序を解決し、精密な運動編集のためにオクルージョンを処理することを可能にする。合成データと実データを用いた2段階の訓練により、本モデルは、共同カメラ/物体操作、運動転送、非剛体変形を含む多様な運動編集をサポートし、動画編集における新たな創造的可能性を解き放つ。

English

Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.

3次元ポイントトラックを用いた生成的ビデオモーション編集

Generative Video Motion Editing with 3D Point Tracks

要旨

Support