動畫故事:搭配檢索增強式影片生成的故事說明
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
July 13, 2023
作者: Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, Qifeng Chen
cs.AI
摘要
為視覺敘事生成影片可能是一個繁瑣且複雜的過程,通常需要進行現場拍攝或圖形動畫渲染。為了避開這些挑戰,我們的主要想法是利用現有豐富的影片片段,通過自定義它們的外觀來綜合生成一部連貫的敘事影片。我們通過開發包含兩個功能模塊的框架來實現這一目標:(i)運動結構檢索,提供由查詢文本描述的所需場景或運動上下文的影片候選項,以及(ii)結構引導的文本到影片合成,根據運動結構和文本提示生成與情節對齊的影片。對於第一個模塊,我們利用現成的影片檢索系統並提取影片深度作為運動結構。對於第二個模塊,我們提出了一個可控的影片生成模型,可以靈活控制結構和角色。通過遵循結構引導和外觀指令來合成影片。為了確保片段之間的視覺一致性,我們提出了一種有效的概念個性化方法,通過文本提示來指定所需的角色身份。廣泛的實驗表明,我們的方法在各種現有基準線上表現出顯著優勢。
English
Generating videos for visual storytelling can be a tedious and complex
process that typically requires either live-action filming or graphics
animation rendering. To bypass these challenges, our key idea is to utilize the
abundance of existing video clips and synthesize a coherent storytelling video
by customizing their appearances. We achieve this by developing a framework
comprised of two functional modules: (i) Motion Structure Retrieval, which
provides video candidates with desired scene or motion context described by
query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates
plot-aligned videos under the guidance of motion structure and text prompts.
For the first module, we leverage an off-the-shelf video retrieval system and
extract video depths as motion structure. For the second module, we propose a
controllable video generation model that offers flexible controls over
structure and characters. The videos are synthesized by following the
structural guidance and appearance instruction. To ensure visual consistency
across clips, we propose an effective concept personalization approach, which
allows the specification of the desired character identities through text
prompts. Extensive experiments demonstrate that our approach exhibits
significant advantages over various existing baselines.