MagicStick:通过控制柄进行可控视频编辑转换
MagicStick: Controllable Video Editing via Control Handle Transformations
December 5, 2023
作者: Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, Qifeng Chen
cs.AI
摘要
基于文本的视频编辑最近引起了相当大的兴趣,可以改变风格或用类似结构的对象进行替换。除此之外,我们展示了形状、大小、位置、运动等属性也可以在视频中进行编辑。我们的关键洞察是,特定内部特征(例如对象的边缘映射或人体姿势)的关键帧变换可以轻松传播到其他帧,以提供生成指导。因此,我们提出了MagicStick,一种可控视频编辑方法,通过利用对提取的内部控制信号的转换来编辑视频属性。具体而言,为了保持外观,我们将预训练的图像扩散模型和ControlNet扩展到时间维度,并训练低秩适应(LORA)层以适应特定场景。然后,在编辑过程中,我们执行反演和编辑框架。不同之处在于,微调的ControlNet被引入到反演和生成中,以利用提出的在反演和编辑的空间注意力图之间进行注意力混合。尽管简洁,我们的方法是第一个展示能够从预训练的文本到图像模型中进行视频属性编辑的方法。我们在统一框架内的许多示例上进行了实验。我们还与具有形状感知的基于文本的编辑和手工制作的运动视频生成进行了比较,展示了我们优于先前作品的优越时间一致性和编辑能力。代码和模型将公开提供。
English
Text-based video editing has recently attracted considerable interest in
changing the style or replacing the objects with a similar structure. Beyond
this, we demonstrate that properties such as shape, size, location, motion,
etc., can also be edited in videos. Our key insight is that the keyframe
transformations of the specific internal feature (e.g., edge maps of objects or
human pose), can easily propagate to other frames to provide generation
guidance. We thus propose MagicStick, a controllable video editing method that
edits the video properties by utilizing the transformation on the extracted
internal control signals. In detail, to keep the appearance, we inflate both
the pretrained image diffusion model and ControlNet to the temporal dimension
and train low-rank adaptions (LORA) layers to fit the specific scenes. Then, in
editing, we perform an inversion and editing framework. Differently, finetuned
ControlNet is introduced in both inversion and generation for attention
guidance with the proposed attention remix between the spatial attention maps
of inversion and editing. Yet succinct, our method is the first method to show
the ability of video property editing from the pre-trained text-to-image model.
We present experiments on numerous examples within our unified framework. We
also compare with shape-aware text-based editing and handcrafted motion video
generation, demonstrating our superior temporal consistency and editing
capability than previous works. The code and models will be made publicly
available.