DreamRunner: 検索拡張された動き適応を用いた細かいストーリーテリングビデオ生成

要旨

ストーリーテリングビデオ生成（SVG）は、最近登場した課題であり、入力テキストスクリプトで記述されたストーリーを一貫して表現する長い、複数の動き、複数のシーンからなるビデオを作成するためのものです。SVGは、メディアやエンターテインメントにおける多様なコンテンツ制作に大きな可能性を秘めていますが、同時に重要な課題も抱えています：（1）オブジェクトは細かく複雑な動きを示さなければならず、（2）複数のオブジェクトがシーン全体で一貫して現れる必要があり、（3）被写体はシーン内でシームレスな遷移を伴う複数の動きが必要とされます。これらの課題に対処するために、私たちはDreamRunnerを提案します。これは、小説的なストーリーからビデオを生成する手法です。まず、大規模言語モデル（LLM）を使用して入力スクリプトを構造化し、粗いシーン計画と細かいオブジェクトレベルのレイアウトおよび動きの計画の両方を容易にします。次に、DreamRunnerは、各シーンのオブジェクトに対するターゲット動きの事前情報をキャプチャするための検索拡張型のテスト時適応を提示し、検索されたビデオに基づいた多様な動きのカスタマイズをサポートし、複雑なスクリプトされた動きを持つ新しいビデオの生成を容易にします。最後に、細かいオブジェクト動きのバインディングおよびフレームごとの意味的制御のための新しい空間的時間領域ベースの3Dアテンションおよび事前注入モジュールSR3AIを提案します。DreamRunnerをさまざまなSVGベースラインと比較し、キャラクターの一貫性、テキストの整合性、スムーズな遷移において最先端のパフォーマンスを示しました。さらに、DreamRunnerは、合成的なテキストからビデオを生成する際の細かい条件に従う能力に強く、T2V-ComBenchでベースラインを大幅に上回りました。最後に、私たちはDreamRunnerの多様な質的例を用いて、複数のオブジェクト間の相互作用を生成する堅牢な能力を検証しました。

English

Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.

DreamRunner: 検索拡張された動き適応を用いた細かいストーリーテリングビデオ生成

DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

要旨

Support