TaleCrafter:多角色交互式故事可视化
TaleCrafter: Interactive Story Visualization with Multiple Characters
May 29, 2023
作者: Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, Yujiu Yang
cs.AI
摘要
准确的故事可视化需要几个必要元素,例如跨帧的身份一致性、纯文本与视觉内容之间的对齐,以及图像中对象的合理布局。大多数先前的研究致力于通过在相同风格和具有相同角色的视频集上拟合文本到图像(T2I)模型来满足这些要求,例如 FlintstonesSV 数据集。然而,学习的 T2I 模型通常难以适应新角色、场景和风格,并且常常缺乏修改合成图像布局的灵活性。本文提出了一个通用交互式故事可视化系统,能够处理多个新颖角色,并支持编辑布局和局部结构。该系统通过利用在大规模语料库上训练的大型语言和 T2I 模型的先验知识而开发。系统包括四个相互连接的组件:故事到提示生成(S2P)、文本到布局生成(T2L)、可控文本到图像生成(C-T2I)和图像到视频动画(I2V)。首先,S2P 模块将简洁的故事信息转换为后续阶段所需的详细提示。接下来,T2L 根据提示生成多样且合理的布局,为用户提供调整和优化布局的能力。核心组件 C-T2I 可以根据布局、草图和特定演员标识符创建图像,以保持可视化中的一致性和细节。最后,I2V 通过为生成的图像添加动画丰富了可视化过程。进行了广泛的实验和用户研究,以验证所提出系统的交互式编辑的有效性和灵活性。
English
Accurate Story visualization requires several necessary elements, such as
identity consistency across frames, the alignment between plain text and visual
content, and a reasonable layout of objects in images. Most previous works
endeavor to meet these requirements by fitting a text-to-image (T2I) model on a
set of videos in the same style and with the same characters, e.g., the
FlintstonesSV dataset. However, the learned T2I models typically struggle to
adapt to new characters, scenes, and styles, and often lack the flexibility to
revise the layout of the synthesized images. This paper proposes a system for
generic interactive story visualization, capable of handling multiple novel
characters and supporting the editing of layout and local structure. It is
developed by leveraging the prior knowledge of large language and T2I models,
trained on massive corpora. The system comprises four interconnected
components: story-to-prompt generation (S2P), text-to-layout generation (T2L),
controllable text-to-image generation (C-T2I), and image-to-video animation
(I2V). First, the S2P module converts concise story information into detailed
prompts required for subsequent stages. Next, T2L generates diverse and
reasonable layouts based on the prompts, offering users the ability to adjust
and refine the layout to their preference. The core component, C-T2I, enables
the creation of images guided by layouts, sketches, and actor-specific
identifiers to maintain consistency and detail across visualizations. Finally,
I2V enriches the visualization process by animating the generated images.
Extensive experiments and a user study are conducted to validate the
effectiveness and flexibility of interactive editing of the proposed system.