AutoStory: 最小限の人的労力で多様なストーリーテリング画像を生成する

要旨

ストーリー可視化は、テキストで記述されたストーリーに一致する一連の画像を生成することを目的としており、生成された画像が高品質であること、テキスト記述との整合性、およびキャラクターの一貫性を満たすことが求められる。ストーリー可視化の複雑さを考慮すると、既存の手法では、特定の少数のキャラクターやシナリオのみを考慮したり、ユーザーにスケッチなどの画像ごとの制御条件を提供することを要求したりすることで、問題を大幅に単純化している。しかし、これらの簡略化により、これらの手法は実用的なアプリケーションには不適格となっている。この問題に対処するため、我々は、最小限の人的介入で多様で高品質かつ一貫性のあるストーリー画像セットを効果的に生成できる自動化されたストーリー可視化システムを提案する。具体的には、大規模言語モデルの理解力と計画能力を活用してレイアウト計画を行い、その後、大規模なテキストから画像へのモデルを利用してレイアウトに基づいて洗練されたストーリー画像を生成する。我々は、バウンディングボックスなどの疎な制御条件がレイアウト計画に適している一方で、スケッチやキーポイントなどの密な制御条件が高品質な画像コンテンツの生成に適していることを実証的に見出した。両方の利点を最大限に活用するため、我々は、単純なバウンディングボックスのレイアウトをスケッチやキーポイントの制御条件に変換する密な条件生成モジュールを考案し、最終的な画像生成に活用する。これにより、画像品質が向上するだけでなく、ユーザーが容易かつ直感的に操作できるようになる。さらに、多視点一貫性のあるキャラクター画像を生成するためのシンプルかつ効果的な手法を提案し、キャラクター画像を収集または描画するための人的労力への依存を排除する。

English

Story visualization aims to generate a series of images that match the story described in texts, and it requires the generated images to satisfy high quality, alignment with the text description, and consistency in character identities. Given the complexity of story visualization, existing methods drastically simplify the problem by considering only a few specific characters and scenarios, or requiring the users to provide per-image control conditions such as sketches. However, these simplifications render these methods incompetent for real applications. To this end, we propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images, with minimal human interactions. Specifically, we utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images based on the layout. We empirically find that sparse control conditions, such as bounding boxes, are suitable for layout planning, while dense control conditions, e.g., sketches and keypoints, are suitable for generating high-quality image content. To obtain the best of both worlds, we devise a dense condition generation module to transform simple bounding box layouts into sketch or keypoint control conditions for final image generation, which not only improves the image quality but also allows easy and intuitive user interactions. In addition, we propose a simple yet effective method to generate multi-view consistent character images, eliminating the reliance on human labor to collect or draw character images.

AutoStory: 最小限の人的労力で多様なストーリーテリング画像を生成する

AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort

要旨

Support