Go-with-the-Track: ポイントトラッキングを用いたビデオ合成とモーション制御

要旨

映画制作には、正確な動き制御と参照画像の合成が求められますが、既存手法ではこれらの機能を別々に扱っています。ポイントトラック条件付き画像-to-ビデオモデルは最初のフレームのみにコンテンツ挿入を制限し、一方で参照-to-ビデオモデルはフレーム間での参照コンテンツの統合に対する粒度の細かい空間的時間的制御を欠いています。本稿では、複数の参照画像と参照アンカー型ポイントトラックを同時に条件付けることで両機能を統合したGo-with-the-Trackを提案します。従来のポイントトラックを拡張し、生成フレームと参照画像間の対応関係を明示的に確立することで、ビデオ全体にわたる精密な合成と動き制御を実現します。これを達成するために、座標単位のMLPと時間的プーリングを用いてポイントトラック座標の全系列を符号化する、空間認識型ポイントトラック埋め込みを導入します。この表現は各ポイントトラックの空間的特性（一意の識別子として機能）を捉えるとともに、埋め込みの類似性が空間的近接性と直接相関するため、モデルがポイントトラックを区別・関連付ける能力を高めます。これらのポイントトラック埋め込みを軽量アダプターを介してビデオ拡散トランスフォーマーに注入することで、ピクセルとパッチ間の解像度不一致を解消し、単純なポイントトラックのダウンサンプリングに内在する大幅な動き詳細情報の損失を回避します。動的・静的・合成シーンのビデオデータセットを共同で学習するハイブリッド学習戦略を用い、動き制御性を向上させています。実験により、Go-with-the-Trackは単一モデルで優れた動き制御と参照制御を達成し、さらに新しい機能として、ポイントトラック駆動合成によるマルチ参照条件付きビデオ生成、ならびに静的・動的シーンの両方に対するカメラ制御を実現します。プロジェクトページ: https://eyeline-labs.github.io/Go-with-the-Track/

English

Filmmaking demands precise motion control and reference image compositing -- capabilities that existing methods treat separately. Point-track-conditioned image-to-video models restrict content insertion to the first frame, while reference-to-video models lack fine-grained spatial-temporal control over how reference content integrates across frames. We present Go-with-the-Track, which unifies both capabilities by jointly conditioning on multiple reference images and reference-anchored point-tracks -- extending conventional point-tracks to explicitly establish correspondences between generated frames and reference images, thus enabling precise compositing and motion control throughout the video. To achieve this, we introduce spatially-aware point-track embeddings that encode the full sequence of point-track coordinates using a coordinate-wise MLP followed by temporal pooling. This representation captures the spatial characteristics of each point-track (serving as a unique identifier), while the embedding similarity correlates directly with spatial proximity, enhancing the model's ability to distinguish and associate point-tracks. We inject these point-track embeddings into a video diffusion transformer via a lightweight adapter, resolving the pixel-to-patch resolution mismatch while avoiding the substantial motion detail loss inherent in naive point-track subsampling. We use a hybrid training strategy to train jointly on dynamic, static, and synthetic scene video datasets to boost motion controllability. Experiments demonstrate that Go-with-the-Track achieves superior motion and reference control in a single model and enables new capabilities: multi-reference conditioned video generation with point-track driven compositing, as well as camera control for both static and dynamic scenes. Project Page: https://eyeline-labs.github.io/Go-with-the-Track/