ChatPaper.aiChatPaper

LivePhoto:通过文本引导的动作控制实现真实图像动画

LivePhoto: Real Image Animation with Text-guided Motion Control

December 5, 2023
作者: Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu Liu, Yujun Shen, Hengshuang Zhao
cs.AI

摘要

尽管文本到视频生成方面取得了近期的进展,现有研究通常忽视了合成视频中只有空间内容而没有时间运动是受文本控制的问题。针对这一挑战,本文提出了一个实用系统,名为LivePhoto,允许用户使用文本描述来为感兴趣的图像添加动画。我们首先建立了一个强大的基准线,帮助一个经过良好训练的文本到图像生成器(即稳定扩散)接受图像作为进一步的输入。然后,我们为改进后的生成器配备了一个用于时间建模的运动模块,并提出了一个精心设计的训练流程,以更好地连接文本和运动。特别地,考虑到(1)文本只能粗略描述运动(例如,不考虑移动速度)和(2)文本可能包含内容和运动描述,我们引入了一个运动强度估计模块以及一个文本重新加权模块,以减少文本到运动映射的歧义性。经验证据表明,我们的方法能够将与运动相关的文本指令很好地解码为视频,例如动作、摄像机移动,甚至从虚空中召唤新内容(例如,将水倒入空杯中)。有趣的是,由于提出的强度学习机制,我们的系统为用户提供了一个额外的控制信号(即运动强度),除文本外用于视频定制。
English
Despite the recent progress in text-to-video generation, existing studies usually overlook the issue that only spatial contents but not temporal motions in synthesized videos are under the control of text. Towards such a challenge, this work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions. We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input. We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions. In particular, considering the facts that (1) text can only describe motions roughly (e.g., regardless of the moving speed) and (2) text may include both content and motion descriptions, we introduce a motion intensity estimation module as well as a text re-weighting module to reduce the ambiguity of text-to-motion mapping. Empirical evidence suggests that our approach is capable of well decoding motion-related textual instructions into videos, such as actions, camera movements, or even conjuring new contents from thin air (e.g., pouring water into an empty glass). Interestingly, thanks to the proposed intensity learning mechanism, our system offers users an additional control signal (i.e., the motion intensity) besides text for video customization.
PDF183December 15, 2024