重现任何事物:使用动作文本反转进行语义视频运动转移
Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion
August 1, 2024
作者: Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber
cs.AI
摘要
近年来,视频生成和编辑方法的质量有了显著提升。虽然有几种技术侧重于编辑外观,但很少涉及运动。目前使用文本、轨迹或边界框的方法仅限于简单的运动,因此我们提出使用单个运动参考视频来指定运动。我们进一步建议使用预训练的图像到视频模型,而不是文本到视频模型。这种方法使我们能够保留目标对象或场景的确切外观和位置,并有助于将外观与运动分离。我们的方法称为运动文本反演,利用了我们的观察结果,即图像到视频模型主要从(潜在的)图像输入中提取外观,而通过交叉注意力注入的文本/图像嵌入主要控制运动。因此,我们使用文本/图像嵌入标记来表示运动。通过在每帧中包含多个文本/图像嵌入标记的膨胀的运动文本嵌入上操作,我们实现了高时间运动粒度。一旦在运动参考视频上进行优化,这种嵌入就可以应用于各种目标图像,以生成具有语义上相似运动的视频。我们的方法不需要运动参考视频和目标图像之间的空间对齐,在各个领域通用,并可应用于各种任务,如全身和面部再现,以及控制无生命物体和摄像机的运动。我们通过实验证明了我们的方法在语义视频运动转移任务中的有效性,在这一背景下明显优于现有方法。
English
Recent years have seen a tremendous improvement in the quality of video
generation and editing approaches. While several techniques focus on editing
appearance, few address motion. Current approaches using text, trajectories, or
bounding boxes are limited to simple motions, so we specify motions with a
single motion reference video instead. We further propose to use a pre-trained
image-to-video model rather than a text-to-video model. This approach allows us
to preserve the exact appearance and position of a target object or scene and
helps disentangle appearance from motion. Our method, called motion-textual
inversion, leverages our observation that image-to-video models extract
appearance mainly from the (latent) image input, while the text/image embedding
injected via cross-attention predominantly controls motion. We thus represent
motion using text/image embedding tokens. By operating on an inflated
motion-text embedding containing multiple text/image embedding tokens per
frame, we achieve a high temporal motion granularity. Once optimized on the
motion reference video, this embedding can be applied to various target images
to generate videos with semantically similar motions. Our approach does not
require spatial alignment between the motion reference video and target image,
generalizes across various domains, and can be applied to various tasks such as
full-body and face reenactment, as well as controlling the motion of inanimate
objects and the camera. We empirically demonstrate the effectiveness of our
method in the semantic video motion transfer task, significantly outperforming
existing methods in this context.Summary
AI-Generated Summary