MVTrack4Gen: 4Dビデオ生成のための幾何学的監督としての多視点ポイントトラッキング

要旨

単眼参照動画から目標カメラ軌道に沿った新規視点動画を合成するには、参照動画に対する幾何的一貫性と動きの忠実性の両方が必要である。明示的な3D表現に基づく既存手法は、既製の再構築モジュールの精度に制限されており、単眼動画内の動的物体に対して不正確な形状を生成することが多い。対照的に、カメラ条件付けのみの手法は高い視覚品質を達成できるが、幾何的および動きの一貫性を維持することにしばしば苦慮する。本研究では、MVTrack4Gen（Multi-View point Tracking for Novel-View Generation）を導入する。これは、カメラ条件付けのみの新規視点動画拡散モデルに対して、追加の幾何的および動きの監視信号として多視点点追跡を活用する動作認識型訓練フレームワークである。我々の主要な発見は、特定のアテンション層が強い対応関係の手がかりを符号化しており、クエリ特徴量が時間的および視点間で幾何的に対応する位置のキー特徴量にアテンションを向けること、そしてこれらの対応関係のずれが動きの不整合を引き起こすことである。この観察に基づき、これらの特徴量を補助的な多視点追跡ヘッドにルーティングし、点追跡目的関数を用いて拡散モデルを共同訓練する。これらの動作認識対応関係を明示的に強化することにより、MVTrack4Genは既存モデルを改善し、参照視点の動きにより忠実に追随し、視点間の幾何的一貫性を維持する。多様なベンチマークにおいて、本手法は最先端の幾何的一貫性と競争力のあるカメラ精度を達成する。

English

Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruction modules, which often produce inaccurate geometry for dynamic objects in monocular videos. In contrast, camera-conditioning-only methods can achieve high visual quality but often struggle to preserve geometric and motion consistency. In this work, we introduce MVTrack4Gen (Multi-View point Tracking for Novel-View Generation), a motion-aware training framework that leverages multi-view point tracking as an additional geometric and motion supervision signal for camera-conditioning-only novel-view video diffusion models. Our key finding is that specific attention layers encode strong correspondence cues, where query features attend to key features at geometrically corresponding locations across views and over time, and the misalignment of these correspondences causes motion inconsistency. Based on this observation, we route these features into an auxiliary multi-view tracking head and jointly train the diffusion model with a point-tracking objective. By explicitly strengthening these motion-aware correspondences, MVTrack4Gen improves existing models to better follow the motion in the reference view and maintain cross-view geometric consistency. Across diverse benchmarks, our method achieves state-of-the-art geometric consistency and competitive camera accuracy.