重现任何事物：使用动作文本反转进行语义视频运动转移

摘要

近年来，视频生成和编辑方法的质量有了显著提升。虽然有几种技术侧重于编辑外观，但很少涉及运动。目前使用文本、轨迹或边界框的方法仅限于简单的运动，因此我们提出使用单个运动参考视频来指定运动。我们进一步建议使用预训练的图像到视频模型，而不是文本到视频模型。这种方法使我们能够保留目标对象或场景的确切外观和位置，并有助于将外观与运动分离。我们的方法称为运动文本反演，利用了我们的观察结果，即图像到视频模型主要从（潜在的）图像输入中提取外观，而通过交叉注意力注入的文本/图像嵌入主要控制运动。因此，我们使用文本/图像嵌入标记来表示运动。通过在每帧中包含多个文本/图像嵌入标记的膨胀的运动文本嵌入上操作，我们实现了高时间运动粒度。一旦在运动参考视频上进行优化，这种嵌入就可以应用于各种目标图像，以生成具有语义上相似运动的视频。我们的方法不需要运动参考视频和目标图像之间的空间对齐，在各个领域通用，并可应用于各种任务，如全身和面部再现，以及控制无生命物体和摄像机的运动。我们通过实验证明了我们的方法在语义视频运动转移任务中的有效性，在这一背景下明显优于现有方法。

English

Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Once optimized on the motion reference video, this embedding can be applied to various target images to generate videos with semantically similar motions. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment, as well as controlling the motion of inanimate objects and the camera. We empirically demonstrate the effectiveness of our method in the semantic video motion transfer task, significantly outperforming existing methods in this context.

重现任何事物：使用动作文本反转进行语义视频运动转移

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

摘要

Support