ChatPaper.aiChatPaper

Track2View:透過配對3D點軌跡實現4D一致性的攝影機可控影片生成

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

June 14, 2026
作者: Feng Qiao, Zhaochong An, Zhexiao Xiong, Serge Belongie, Nathan Jacobs
cs.AI

摘要

從新穎的相機視角重新渲染現有影片,要求輸出遵循指定的相機軌跡,同時在每一幀中保留原始場景的外觀與動態。現有方法依賴每幀的姿態嵌入、含有噪聲的點雲渲染或隱式學習的對應關係,但這些方法都無法在源像素與目標像素之間提供明確且時間連續的連結。我們提出 Track2View,其核心是讓影片擴散變換器以配對的 3D 點軌跡為條件:即場景點投影至源相機與目標相機視角所形成之稀疏軌跡。這些軌跡提供了明確的時空對應關係,本質上具有時間連續性,編碼了何種內容應出現於何處與何時。Track2View 的核心是一個雙視角軌跡調節器,透過無參數的幾何運算與學習得到的時間聚合,將視覺背景從源視角傳遞至目標視角,確保能夠泛化至任意相機軌跡,而不會記憶特定運動。我們進一步引入一條數據整理流程,透過在時間上串聯的多相機視角對上運行 3D 點追蹤器,提取出一對一的軌跡對應關係。在一個涵蓋靜態與動態場景、共 400 部影片的基準測試中,Track2View 在視覺品質、視角同步與相機準確度方面均達到最先進的成果,與領先基線相比,旋轉誤差降低了 30-65%,平移誤差降低了 61-72%。專案頁面位於此 https URL:https://qjizhi.github.io/track2view
English
Re-rendering an existing video from a novel camera viewpoint requires the output to follow the prescribed camera trajectory while preserving the appearance and dynamics of the original scene across every frame. Existing methods rely on per-frame pose embeddings, noisy point-cloud renderings, or implicit learned correspondences, none of which provides an explicit, temporally continuous link between source and target pixels. We propose Track2View, which conditions a video diffusion transformer on paired 3D point tracks: sparse trajectories of scene points projected into both the source and target camera views. These tracks provide explicit spatiotemporal correspondences that are temporally continuous by construction, encoding what content should appear where and when. At the core of Track2View is a dual-view track conditioner that transfers visual context from source to target view through parameter-free geometric operations and learned temporal aggregation, ensuring generalization to arbitrary camera trajectories without memorizing specific motions. We further introduce a data curation pipeline that extracts one-to-one track correspondences by running a 3D point tracker on temporally concatenated multi-camera view pairs. On a 400-video benchmark spanning static and dynamic scenes, Track2View achieves state-of-the-art results across visual quality, view synchronization, and camera accuracy, reducing rotation error by 30-65% and translation error by 61-72% relative to leading baselines. Project page is available at this https URL: https://qjizhi.github.io/track2view