LoRA-Edit:基于掩码感知LoRA微调的首帧引导可控视频编辑
LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning
June 11, 2025
作者: Chenjian Gao, Lihe Ding, Xin Cai, Zhanpeng Huang, Zibin Wang, Tianfan Xue
cs.AI
摘要
利用扩散模型进行视频编辑在生成高质量视频编辑方面已取得显著成果。然而,现有方法通常依赖于大规模预训练,限制了特定编辑的灵活性。首帧引导编辑虽能控制首帧,但对后续帧的灵活性不足。为此,我们提出了一种基于掩码的LoRA(低秩适应)调优方法,通过调整预训练的图像到视频(I2V)模型来实现灵活的视频编辑。该方法在保留背景区域的同时,实现了可控的编辑传播,提供了一种高效且适应性强的视频编辑方案,而无需改变模型架构。为了更好地引导这一过程,我们引入了额外参考,如不同视角或代表性场景状态,作为内容展开的视觉锚点。我们采用掩码驱动的LoRA调优策略来解决控制难题,使预训练的I2V模型适应编辑上下文。模型需从两个不同来源学习:输入视频提供空间结构和运动线索,而参考图像则提供外观指导。空间掩码通过动态调节模型关注点,实现区域特定学习,确保每个区域从适当来源汲取信息。实验结果表明,与最先进方法相比,我们的方法在视频编辑性能上表现更优。
English
Video editing using diffusion models has achieved remarkable results in
generating high-quality edits for videos. However, current methods often rely
on large-scale pretraining, limiting flexibility for specific edits.
First-frame-guided editing provides control over the first frame, but lacks
flexibility over subsequent frames. To address this, we propose a mask-based
LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video
(I2V) models for flexible video editing. Our approach preserves background
regions while enabling controllable edits propagation. This solution offers
efficient and adaptable video editing without altering the model architecture.
To better steer this process, we incorporate additional references, such as
alternate viewpoints or representative scene states, which serve as visual
anchors for how content should unfold. We address the control challenge using a
mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model
to the editing context. The model must learn from two distinct sources: the
input video provides spatial structure and motion cues, while reference images
offer appearance guidance. A spatial mask enables region-specific learning by
dynamically modulating what the model attends to, ensuring that each area draws
from the appropriate source. Experimental results show our method achieves
superior video editing performance compared to state-of-the-art methods.