ChatPaper.aiChatPaper

动画故事:检索增强视频生成的讲故事

Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

July 13, 2023
作者: Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, Qifeng Chen
cs.AI

摘要

为视觉叙事生成视频通常是一个繁琐复杂的过程,通常需要现场拍摄或图形动画渲染。为了避开这些挑战,我们的关键思路是利用现有视频片段的丰富资源,通过定制外观合成连贯的叙事视频。我们通过开发一个包含两个功能模块的框架来实现这一目标:(i) 运动结构检索,提供具有由查询文本描述的所需场景或运动背景的视频候选项,以及(ii) 结构引导的文本到视频合成,根据运动结构和文本提示生成与情节对齐的视频。对于第一个模块,我们利用现成的视频检索系统,并提取视频深度作为运动结构。对于第二个模块,我们提出了一个可控视频生成模型,可灵活控制结构和角色。视频是通过遵循结构指导和外观指令来合成的。为确保各个片段之间的视觉一致性,我们提出了一种有效的概念个性化方法,允许通过文本提示指定所需的角色身份。大量实验证明,我们的方法在各种现有基线上具有显著优势。
English
Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. We achieve this by developing a framework comprised of two functional modules: (i) Motion Structure Retrieval, which provides video candidates with desired scene or motion context described by query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates plot-aligned videos under the guidance of motion structure and text prompts. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters. The videos are synthesized by following the structural guidance and appearance instruction. To ensure visual consistency across clips, we propose an effective concept personalization approach, which allows the specification of the desired character identities through text prompts. Extensive experiments demonstrate that our approach exhibits significant advantages over various existing baselines.
PDF100December 15, 2024