Track2View:通过配对3D点轨迹实现4D一致的相机控制视频生成
Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks
June 14, 2026
作者: Feng Qiao, Zhaochong An, Zhexiao Xiong, Serge Belongie, Nathan Jacobs
cs.AI
摘要
从新颖视角重新渲染现有视频,要求输出在遵循指定相机轨迹的同时,保持原始场景每一帧的外观和动态。现有方法依赖逐帧姿态嵌入、含噪点云渲染或隐式学习对应关系,但均无法在源像素与目标像素之间建立显式的时间连续链接。我们提出Track2View方法,该方法通过成对的3D点轨迹来调控视频扩散变换器:这些轨迹是场景点在源视角和目标视角投影形成的稀疏路径。这些轨迹通过构建显式的时空对应关系,天然具有时间连续性,能够编码内容应在何时出现在何处。Track2View的核心是一个双视角轨迹调节器,通过无参数几何操作和习得的时间聚合,将视觉上下文从源视角迁移至目标视角,确保对任意相机轨迹的泛化能力,无需记忆特定运动模式。我们进一步引入数据清洗流程:通过在时间维度拼接的多视角视频对中运行3D点追踪器,提取一对一的轨迹对应关系。在涵盖静态与动态场景的400段视频基准测试中,Track2View在视觉质量、视角同步性和相机精度方面均达到最优水平,相较主流基线方法,旋转误差降低30%-65%,平移误差降低61%-72。项目页面可通过此链接访问:https://qjizhi.github.io/track2view
English
Re-rendering an existing video from a novel camera viewpoint requires the output to follow the prescribed camera trajectory while preserving the appearance and dynamics of the original scene across every frame. Existing methods rely on per-frame pose embeddings, noisy point-cloud renderings, or implicit learned correspondences, none of which provides an explicit, temporally continuous link between source and target pixels. We propose Track2View, which conditions a video diffusion transformer on paired 3D point tracks: sparse trajectories of scene points projected into both the source and target camera views. These tracks provide explicit spatiotemporal correspondences that are temporally continuous by construction, encoding what content should appear where and when. At the core of Track2View is a dual-view track conditioner that transfers visual context from source to target view through parameter-free geometric operations and learned temporal aggregation, ensuring generalization to arbitrary camera trajectories without memorizing specific motions. We further introduce a data curation pipeline that extracts one-to-one track correspondences by running a 3D point tracker on temporally concatenated multi-camera view pairs. On a 400-video benchmark spanning static and dynamic scenes, Track2View achieves state-of-the-art results across visual quality, view synchronization, and camera accuracy, reducing rotation error by 30-65% and translation error by 61-72% relative to leading baselines. Project page is available at this https URL: https://qjizhi.github.io/track2view