MagicStick:透過控制手柄實現可控制的影片編輯變換
MagicStick: Controllable Video Editing via Control Handle Transformations
December 5, 2023
作者: Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, Qifeng Chen
cs.AI
摘要
基於文本的影片編輯近來引起相當大的興趣,可用於改變風格或以類似結構替換物件。此外,我們展示了形狀、大小、位置、運動等屬性也可在影片中進行編輯。我們的關鍵見解是,特定內部特徵的關鍵幀轉換(例如物件的邊緣地圖或人體姿勢),可以輕鬆地傳播到其他幀以提供生成指導。因此,我們提出了MagicStick,一種可控制的影片編輯方法,通過利用對提取的內部控制信號進行轉換來編輯影片屬性。具體而言,為了保持外觀,我們將預訓練的圖像擴散模型和ControlNet擴展到時間維度,並訓練低秩適應(LORA)層以適應特定場景。然後,在編輯時,我們執行反演和編輯框架。不同的是,在反演和生成中引入了微調的ControlNet,以提供注意力引導,並使用所提出的空間注意力圖之間的注意力混音。儘管簡潔,我們的方法是第一種展示從預先訓練的文本到圖像模型進行影片屬性編輯的方法。我們在我們的統一框架內展示了許多示例的實驗。我們還將其與具有形狀感知的基於文本的編輯和手工製作的運動影片生成進行比較,展示了我們優於先前作品的時間一致性和編輯能力。代碼和模型將公開提供。
English
Text-based video editing has recently attracted considerable interest in
changing the style or replacing the objects with a similar structure. Beyond
this, we demonstrate that properties such as shape, size, location, motion,
etc., can also be edited in videos. Our key insight is that the keyframe
transformations of the specific internal feature (e.g., edge maps of objects or
human pose), can easily propagate to other frames to provide generation
guidance. We thus propose MagicStick, a controllable video editing method that
edits the video properties by utilizing the transformation on the extracted
internal control signals. In detail, to keep the appearance, we inflate both
the pretrained image diffusion model and ControlNet to the temporal dimension
and train low-rank adaptions (LORA) layers to fit the specific scenes. Then, in
editing, we perform an inversion and editing framework. Differently, finetuned
ControlNet is introduced in both inversion and generation for attention
guidance with the proposed attention remix between the spatial attention maps
of inversion and editing. Yet succinct, our method is the first method to show
the ability of video property editing from the pre-trained text-to-image model.
We present experiments on numerous examples within our unified framework. We
also compare with shape-aware text-based editing and handcrafted motion video
generation, demonstrating our superior temporal consistency and editing
capability than previous works. The code and models will be made publicly
available.