AutoStory:以最小人力生成多樣故事圖像
AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort
November 19, 2023
作者: Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, Chunhua Shen
cs.AI
摘要
故事視覺化旨在生成一系列符合文本描述的圖像,需要這些生成的圖像具有高品質、與文本描述一致,並保持角色身份的一致性。鑒於故事視覺化的複雜性,現有方法通常通過僅考慮少數特定角色和情境,或要求使用者提供每幅圖像的控制條件(如草圖)來大幅簡化問題。然而,這些簡化使得這些方法無法應用於實際場景。因此,我們提出了一個自動化故事視覺化系統,可以有效生成多樣、高品質和一致的故事圖像集,並最大程度減少人類參與。具體而言,我們利用大型語言模型的理解和規劃能力進行佈局規劃,然後利用大規模文本到圖像模型基於佈局生成複雜的故事圖像。我們在實證中發現,稀疏的控制條件,如邊界框,適合於佈局規劃,而密集的控制條件,例如草圖和關鍵點,適合於生成高品質的圖像內容。為了兼顧兩者的優勢,我們設計了一個密集條件生成模組,將簡單的邊界框佈局轉換為草圖或關鍵點控制條件用於最終圖像生成,這不僅提高了圖像質量,還使用者可以輕鬆直觀地進行交互。此外,我們提出了一種簡單而有效的方法來生成多視角一致的角色圖像,消除了依賴人工收集或繪製角色圖像的需求。
English
Story visualization aims to generate a series of images that match the story
described in texts, and it requires the generated images to satisfy high
quality, alignment with the text description, and consistency in character
identities. Given the complexity of story visualization, existing methods
drastically simplify the problem by considering only a few specific characters
and scenarios, or requiring the users to provide per-image control conditions
such as sketches. However, these simplifications render these methods
incompetent for real applications. To this end, we propose an automated story
visualization system that can effectively generate diverse, high-quality, and
consistent sets of story images, with minimal human interactions. Specifically,
we utilize the comprehension and planning capabilities of large language models
for layout planning, and then leverage large-scale text-to-image models to
generate sophisticated story images based on the layout. We empirically find
that sparse control conditions, such as bounding boxes, are suitable for layout
planning, while dense control conditions, e.g., sketches and keypoints, are
suitable for generating high-quality image content. To obtain the best of both
worlds, we devise a dense condition generation module to transform simple
bounding box layouts into sketch or keypoint control conditions for final image
generation, which not only improves the image quality but also allows easy and
intuitive user interactions. In addition, we propose a simple yet effective
method to generate multi-view consistent character images, eliminating the
reliance on human labor to collect or draw character images.