ChatPaper.aiChatPaper

扭曲即歷史:基於單一訓練視頻的可泛化相機控制視頻生成

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

May 14, 2026
作者: Yifan Wang, Tong He
cs.AI

摘要

相機控制下的影片生成技術已取得顯著進展,能讓生成的影片遵循指定的視角軌跡。然而,現有方法通常透過相機編碼器、控制分支或注意力機制與位置編碼調整來學習特定相機條件,這往往需要在具大規模相機標註的影片上進行後訓練。無需訓練的替代方案能避免此類後訓練,但常將成本轉移至測試時優化或額外的去噪階段引導。我們提出「扭曲即歷史」(Warp-as-History)方法,這是一個簡潔的介面,能將相機引起的扭曲轉換為具目標幀位置對齊與可見令牌選擇的相機扭曲偽歷史。給定目標相機軌跡後,我們根據過往觀測建構相機扭曲偽歷史,並將其饋入模型的視覺歷史路徑。關鍵在於,我們將其位置編碼與正被去噪的目標幀對齊,並移除缺乏有效來源觀測的扭曲歷史令牌。無需任何訓練、架構修改或測試時優化,此介面即可釋放出凍結影片生成模型遵循相機軌跡的非平凡零樣本能力。此外,僅需對單一具相機標註的影片進行輕量級離線LoRA微調,即可進一步提升此能力,並泛化至未見影片,在無需測試時優化或目標影片自適應的條件下,增強相機遵循度、視覺品質與運動動態。在多重資料集上的廣泛實驗證實了我們方法的有效性。
English
Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.