TaleCrafter：複数キャラクターを用いたインタラクティブなストーリー可視化

要旨

正確なストーリー可視化には、フレーム間の同一性の一貫性、平文と視覚的コンテンツの整合性、画像内のオブジェクトの合理的なレイアウトなど、いくつかの必要な要素があります。これまでの研究の多くは、同じスタイルとキャラクターを持つ一連のビデオ（例えば、FlintstonesSVデータセット）にテキストから画像（T2I）モデルを適合させることで、これらの要件を満たそうとしてきました。しかし、学習されたT2Iモデルは、新しいキャラクター、シーン、スタイルに適応するのが難しく、合成された画像のレイアウトを修正する柔軟性に欠けることが多いです。本論文では、複数の新しいキャラクターを扱い、レイアウトや局所的な構造の編集をサポートする、汎用的なインタラクティブなストーリー可視化システムを提案します。このシステムは、大規模なコーパスで訓練された大規模言語モデルとT2Iモデルの事前知識を活用して開発されています。システムは、ストーリーからプロンプト生成（S2P）、テキストからレイアウト生成（T2L）、制御可能なテキストから画像生成（C-T2I）、画像からビデオアニメーション（I2V）という4つの相互接続されたコンポーネントで構成されています。まず、S2Pモジュールは簡潔なストーリー情報を、後続の段階で必要な詳細なプロンプトに変換します。次に、T2Lはプロンプトに基づいて多様で合理的なレイアウトを生成し、ユーザーがレイアウトを調整および洗練する能力を提供します。中核となるC-T2Iコンポーネントは、レイアウト、スケッチ、およびアクター固有の識別子に導かれた画像の作成を可能にし、可視化全体で一貫性と詳細を維持します。最後に、I2Vは生成された画像をアニメーション化することで、可視化プロセスを豊かにします。提案システムの有効性とインタラクティブ編集の柔軟性を検証するために、広範な実験とユーザー調査が行われました。

English

Accurate Story visualization requires several necessary elements, such as identity consistency across frames, the alignment between plain text and visual content, and a reasonable layout of objects in images. Most previous works endeavor to meet these requirements by fitting a text-to-image (T2I) model on a set of videos in the same style and with the same characters, e.g., the FlintstonesSV dataset. However, the learned T2I models typically struggle to adapt to new characters, scenes, and styles, and often lack the flexibility to revise the layout of the synthesized images. This paper proposes a system for generic interactive story visualization, capable of handling multiple novel characters and supporting the editing of layout and local structure. It is developed by leveraging the prior knowledge of large language and T2I models, trained on massive corpora. The system comprises four interconnected components: story-to-prompt generation (S2P), text-to-layout generation (T2L), controllable text-to-image generation (C-T2I), and image-to-video animation (I2V). First, the S2P module converts concise story information into detailed prompts required for subsequent stages. Next, T2L generates diverse and reasonable layouts based on the prompts, offering users the ability to adjust and refine the layout to their preference. The core component, C-T2I, enables the creation of images guided by layouts, sketches, and actor-specific identifiers to maintain consistency and detail across visualizations. Finally, I2V enriches the visualization process by animating the generated images. Extensive experiments and a user study are conducted to validate the effectiveness and flexibility of interactive editing of the proposed system.

TaleCrafter：複数キャラクターを用いたインタラクティブなストーリー可視化

TaleCrafter: Interactive Story Visualization with Multiple Characters

要旨

Support