SketchVideo:基於草圖的視頻生成與編輯
SketchVideo: Sketch-based Video Generation and Editing
March 30, 2025
作者: Feng-Lin Liu, Hongbo Fu, Xintao Wang, Weicai Ye, Pengfei Wan, Di Zhang, Lin Gao
cs.AI
摘要
基於文本提示或圖像的視頻生成與編輯技術已取得顯著進展。然而,僅憑文本精確控制全局佈局與幾何細節,以及通過圖像實現運動控制與局部修改,仍面臨挑戰。本文旨在實現基於草圖的空間與運動控制視頻生成,並支持對真實或合成視頻進行細粒度編輯。基於DiT視頻生成模型,我們提出了一種內存高效的控制結構,其中包含草圖控制塊,用於預測跳過DiT塊的殘差特徵。草圖可繪製於一或兩個關鍵幀(任意時間點)上,便於交互。為將此類時間稀疏的草圖條件傳播至所有幀,我們提出了一種幀間注意力機制,以分析關鍵幀與每幀視頻之間的關係。針對基於草圖的視頻編輯,我們設計了一個額外的視頻插入模塊,確保新編輯內容與原視頻的空間特徵及動態運動之間的一致性。在推理過程中,我們採用潛在融合技術,精確保留未編輯區域。大量實驗表明,我們的SketchVideo在可控視頻生成與編輯方面表現卓越。
English
Video generation and editing conditioned on text prompts or images have
undergone significant advancements. However, challenges remain in accurately
controlling global layout and geometry details solely by texts, and supporting
motion control and local modification through images. In this paper, we aim to
achieve sketch-based spatial and motion control for video generation and
support fine-grained editing of real or synthetic videos. Based on the DiT
video generation model, we propose a memory-efficient control structure with
sketch control blocks that predict residual features of skipped DiT blocks.
Sketches are drawn on one or two keyframes (at arbitrary time points) for easy
interaction. To propagate such temporally sparse sketch conditions across all
frames, we propose an inter-frame attention mechanism to analyze the
relationship between the keyframes and each video frame. For sketch-based video
editing, we design an additional video insertion module that maintains
consistency between the newly edited content and the original video's spatial
feature and dynamic motion. During inference, we use latent fusion for the
accurate preservation of unedited regions. Extensive experiments demonstrate
that our SketchVideo achieves superior performance in controllable video
generation and editing.Summary
AI-Generated Summary