MVTrack4Gen: 多视角点跟踪作为4D视频生成的几何监督

摘要

从单目参考视频沿目标相机轨迹合成新视角视频，需要与参考视频在几何一致性和运动保真度上保持对齐。基于显式3D表示的现有方法受限于现成重建模块的精度，这些模块在处理单目视频中的动态物体时往往生成不准确的几何结构。相比之下，仅基于相机条件的方法虽能实现高视觉质量，却常难以维持几何与运动的一致性。本文提出MVTrack4Gen（面向新视角生成的多视角点跟踪），一种运动感知训练框架，通过利用多视角点跟踪作为额外的几何与运动监督信号，增强仅基于相机条件的新视角视频扩散模型。我们的关键发现是，特定注意力层编码了强对应线索：查询特征会关注跨视角与跨时间几何对应位置的关键特征，而这些对应的错位会导致运动不一致。基于此观察，我们将这些特征路由至辅助多视角跟踪头，并与点跟踪目标联合训练扩散模型。通过显式强化这些运动感知对应，MVTrack4Gen改进了现有模型，使其能更准确地跟随参考视角中的运动并保持跨视角几何一致性。在多个基准测试中，我们的方法取得了最先进的几何一致性与具有竞争力的相机精度。

English

Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruction modules, which often produce inaccurate geometry for dynamic objects in monocular videos. In contrast, camera-conditioning-only methods can achieve high visual quality but often struggle to preserve geometric and motion consistency. In this work, we introduce MVTrack4Gen (Multi-View point Tracking for Novel-View Generation), a motion-aware training framework that leverages multi-view point tracking as an additional geometric and motion supervision signal for camera-conditioning-only novel-view video diffusion models. Our key finding is that specific attention layers encode strong correspondence cues, where query features attend to key features at geometrically corresponding locations across views and over time, and the misalignment of these correspondences causes motion inconsistency. Based on this observation, we route these features into an auxiliary multi-view tracking head and jointly train the diffusion model with a point-tracking objective. By explicitly strengthening these motion-aware correspondences, MVTrack4Gen improves existing models to better follow the motion in the reference view and maintain cross-view geometric consistency. Across diverse benchmarks, our method achieves state-of-the-art geometric consistency and competitive camera accuracy.