LoRA-Edit:基於遮罩感知LoRA微調的可控首幀引導視頻編輯
LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning
June 11, 2025
作者: Chenjian Gao, Lihe Ding, Xin Cai, Zhanpeng Huang, Zibin Wang, Tianfan Xue
cs.AI
摘要
利用扩散模型進行視頻編輯,在生成高質量視頻編輯方面已取得顯著成果。然而,當前方法往往依賴於大規模預訓練,限制了特定編輯的靈活性。首幀引導編輯雖能控制首幀,但對後續幀的靈活性不足。為此,我們提出了一種基於掩碼的低秩適應(LoRA)調優方法,該方法使預訓練的圖像到視頻(I2V)模型適應靈活的視頻編輯需求。我們的方案在保留背景區域的同時,實現了可控的編輯傳播,提供了一種無需改變模型架構的高效且適應性強的視頻編輯解決方案。為更好地引導這一過程,我們引入了額外參考,如替代視角或代表性場景狀態,這些參考作為視覺錨點,指導內容應如何展開。我們採用掩碼驅動的LoRA調優策略來應對控制挑戰,該策略使預訓練的圖像到視頻模型適應編輯上下文。模型需從兩個不同來源學習:輸入視頻提供空間結構和運動線索,而參考圖像則提供外觀指導。空間掩碼通過動態調節模型關注的內容,實現區域特定學習,確保每個區域從適當的來源汲取信息。實驗結果表明,與最先進的方法相比,我們的方法在視頻編輯性能上表現更優。
English
Video editing using diffusion models has achieved remarkable results in
generating high-quality edits for videos. However, current methods often rely
on large-scale pretraining, limiting flexibility for specific edits.
First-frame-guided editing provides control over the first frame, but lacks
flexibility over subsequent frames. To address this, we propose a mask-based
LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video
(I2V) models for flexible video editing. Our approach preserves background
regions while enabling controllable edits propagation. This solution offers
efficient and adaptable video editing without altering the model architecture.
To better steer this process, we incorporate additional references, such as
alternate viewpoints or representative scene states, which serve as visual
anchors for how content should unfold. We address the control challenge using a
mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model
to the editing context. The model must learn from two distinct sources: the
input video provides spatial structure and motion cues, while reference images
offer appearance guidance. A spatial mask enables region-specific learning by
dynamically modulating what the model attends to, ensuring that each area draws
from the appropriate source. Experimental results show our method achieves
superior video editing performance compared to state-of-the-art methods.