LivePhoto:透過文字引導的真實影像動畫控制
LivePhoto: Real Image Animation with Text-guided Motion Control
December 5, 2023
作者: Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu Liu, Yujun Shen, Hengshuang Zhao
cs.AI
摘要
儘管最近在文本轉視頻生成方面取得了進展,現有研究通常忽略了合成視頻中只有空間內容而沒有時間運動是由文本控制的問題。為應對這一挑戰,本研究提出了一個名為 LivePhoto 的實用系統,允許用戶通過文本描述將感興趣的圖像進行動畫化。我們首先建立了一個強大的基準線,幫助一個訓練良好的文本轉圖像生成器(即 Stable Diffusion)將圖像作為進一步的輸入。然後,我們為改進後的生成器配備了一個用於時間建模的運動模塊,並提出了一個精心設計的訓練流程,以更好地關聯文本和運動。特別是,考慮到(1)文本只能粗略描述運動(例如,不考慮移動速度)和(2)文本可能包含內容和運動描述,我們引入了一個運動強度估計模塊以及一個文本重新加權模塊,以減少文本到運動映射的模糊性。實證證據表明,我們的方法能夠將與運動相關的文本指令很好地解碼為視頻,例如動作、攝像機運動,甚至從虛空中召喚新內容(例如,將水倒入空杯中)。有趣的是,由於所提出的強度學習機制,我們的系統為用戶提供了一個額外的控制信號(即運動強度),除了文本外,用於視頻定制。
English
Despite the recent progress in text-to-video generation, existing studies
usually overlook the issue that only spatial contents but not temporal motions
in synthesized videos are under the control of text. Towards such a challenge,
this work presents a practical system, named LivePhoto, which allows users to
animate an image of their interest with text descriptions. We first establish a
strong baseline that helps a well-learned text-to-image generator (i.e., Stable
Diffusion) take an image as a further input. We then equip the improved
generator with a motion module for temporal modeling and propose a carefully
designed training pipeline to better link texts and motions. In particular,
considering the facts that (1) text can only describe motions roughly (e.g.,
regardless of the moving speed) and (2) text may include both content and
motion descriptions, we introduce a motion intensity estimation module as well
as a text re-weighting module to reduce the ambiguity of text-to-motion
mapping. Empirical evidence suggests that our approach is capable of well
decoding motion-related textual instructions into videos, such as actions,
camera movements, or even conjuring new contents from thin air (e.g., pouring
water into an empty glass). Interestingly, thanks to the proposed intensity
learning mechanism, our system offers users an additional control signal (i.e.,
the motion intensity) besides text for video customization.