AutoStory：利用最少的人力资源生成多样化的故事图像

摘要

故事可视化旨在生成一系列与文本描述的故事相匹配的图像，需要生成的图像具有高质量、与文本描述一致，并保持角色身份的连贯性。鉴于故事可视化的复杂性，现有方法通过仅考虑少数特定角色和情景，或要求用户提供每个图像的控制条件（如草图），从而大大简化了问题。然而，这些简化使得这些方法在实际应用中无法胜任。因此，我们提出了一种自动化故事可视化系统，能够有效生成多样化、高质量和连贯的故事图像集，减少人类干预。具体而言，我们利用大型语言模型的理解和规划能力进行布局规划，然后利用大规模文本到图像模型基于布局生成复杂的故事图像。我们经验性地发现，稀疏的控制条件，如边界框，适合布局规划，而密集的控制条件，例如草图和关键点，适合生成高质量的图像内容。为了兼顾两者的优势，我们设计了一个密集条件生成模块，将简单的边界框布局转换为草图或关键点控制条件用于最终图像生成，这不仅提高了图像质量，还使用户交互简单直观。此外，我们提出了一种简单而有效的方法来生成多视角一致的角色图像，消除了依赖人工收集或绘制角色图像的需求。

English

Story visualization aims to generate a series of images that match the story described in texts, and it requires the generated images to satisfy high quality, alignment with the text description, and consistency in character identities. Given the complexity of story visualization, existing methods drastically simplify the problem by considering only a few specific characters and scenarios, or requiring the users to provide per-image control conditions such as sketches. However, these simplifications render these methods incompetent for real applications. To this end, we propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images, with minimal human interactions. Specifically, we utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images based on the layout. We empirically find that sparse control conditions, such as bounding boxes, are suitable for layout planning, while dense control conditions, e.g., sketches and keypoints, are suitable for generating high-quality image content. To obtain the best of both worlds, we devise a dense condition generation module to transform simple bounding box layouts into sketch or keypoint control conditions for final image generation, which not only improves the image quality but also allows easy and intuitive user interactions. In addition, we propose a simple yet effective method to generate multi-view consistent character images, eliminating the reliance on human labor to collect or draw character images.

AutoStory：利用最少的人力资源生成多样化的故事图像

AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort

摘要

Support