Animate-A-Story: 検索拡張型ビデオ生成を用いたストーリーテリング

要旨

ビジュアルストーリーテリングのための動画生成は、通常、実写撮影やグラフィックアニメーションのレンダリングを必要とする、面倒で複雑なプロセスです。これらの課題を回避するため、私たちの主要なアイデアは、既存の動画クリップの豊富さを活用し、それらの外観をカスタマイズすることで、一貫性のあるストーリーテリング動画を合成することです。これを実現するために、2つの機能モジュールからなるフレームワークを開発しました：(i) モーション構造検索（Motion Structure Retrieval）は、クエリテキストで記述された望ましいシーンやモーションのコンテキストを持つ動画候補を提供し、(ii) 構造誘導型テキスト・ツー・ビデオ合成（Structure-Guided Text-to-Video Synthesis）は、モーション構造とテキストプロンプトのガイダンスのもとで、プロットに沿った動画を生成します。最初のモジュールでは、既存の動画検索システムを活用し、動画の深度をモーション構造として抽出します。2番目のモジュールでは、構造とキャラクターに対して柔軟な制御を提供する制御可能な動画生成モデルを提案します。動画は、構造的ガイダンスと外観指示に従って合成されます。クリップ間の視覚的一貫性を確保するために、テキストプロンプトを通じて望ましいキャラクターのアイデンティティを指定できる効果的な概念パーソナライゼーションアプローチを提案します。広範な実験により、私たちのアプローチが既存のさまざまなベースラインに対して大きな優位性を示すことが実証されています。

English

Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. We achieve this by developing a framework comprised of two functional modules: (i) Motion Structure Retrieval, which provides video candidates with desired scene or motion context described by query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates plot-aligned videos under the guidance of motion structure and text prompts. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters. The videos are synthesized by following the structural guidance and appearance instruction. To ensure visual consistency across clips, we propose an effective concept personalization approach, which allows the specification of the desired character identities through text prompts. Extensive experiments demonstrate that our approach exhibits significant advantages over various existing baselines.

Animate-A-Story: 検索拡張型ビデオ生成を用いたストーリーテリング

Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

要旨

Support