ChatPaper.aiChatPaper

MagicStick:通过控制柄进行可控视频编辑转换

MagicStick: Controllable Video Editing via Control Handle Transformations

December 5, 2023
作者: Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, Qifeng Chen
cs.AI

摘要

基于文本的视频编辑最近引起了相当大的兴趣,可以改变风格或用类似结构的对象进行替换。除此之外,我们展示了形状、大小、位置、运动等属性也可以在视频中进行编辑。我们的关键洞察是,特定内部特征(例如对象的边缘映射或人体姿势)的关键帧变换可以轻松传播到其他帧,以提供生成指导。因此,我们提出了MagicStick,一种可控视频编辑方法,通过利用对提取的内部控制信号的转换来编辑视频属性。具体而言,为了保持外观,我们将预训练的图像扩散模型和ControlNet扩展到时间维度,并训练低秩适应(LORA)层以适应特定场景。然后,在编辑过程中,我们执行反演和编辑框架。不同之处在于,微调的ControlNet被引入到反演和生成中,以利用提出的在反演和编辑的空间注意力图之间进行注意力混合。尽管简洁,我们的方法是第一个展示能够从预训练的文本到图像模型中进行视频属性编辑的方法。我们在统一框架内的许多示例上进行了实验。我们还与具有形状感知的基于文本的编辑和手工制作的运动视频生成进行了比较,展示了我们优于先前作品的优越时间一致性和编辑能力。代码和模型将公开提供。
English
Text-based video editing has recently attracted considerable interest in changing the style or replacing the objects with a similar structure. Beyond this, we demonstrate that properties such as shape, size, location, motion, etc., can also be edited in videos. Our key insight is that the keyframe transformations of the specific internal feature (e.g., edge maps of objects or human pose), can easily propagate to other frames to provide generation guidance. We thus propose MagicStick, a controllable video editing method that edits the video properties by utilizing the transformation on the extracted internal control signals. In detail, to keep the appearance, we inflate both the pretrained image diffusion model and ControlNet to the temporal dimension and train low-rank adaptions (LORA) layers to fit the specific scenes. Then, in editing, we perform an inversion and editing framework. Differently, finetuned ControlNet is introduced in both inversion and generation for attention guidance with the proposed attention remix between the spatial attention maps of inversion and editing. Yet succinct, our method is the first method to show the ability of video property editing from the pre-trained text-to-image model. We present experiments on numerous examples within our unified framework. We also compare with shape-aware text-based editing and handcrafted motion video generation, demonstrating our superior temporal consistency and editing capability than previous works. The code and models will be made publicly available.
PDF112December 15, 2024