AniMaker: MCTS駆動型クリップ生成による自動化マルチエージェントアニメーションストーリーテリング

要旨

ビデオ生成モデルの急速な進展にもかかわらず、複数のシーンやキャラクターにまたがる一貫したストーリーテリングビデオの生成は依然として困難である。現在の手法では、事前に生成されたキーフレームを固定長のクリップに変換することが多く、これにより断片的なナラティブやペーシングの問題が生じる。さらに、ビデオ生成モデルの内在的な不安定性により、単一の低品質なクリップが出力アニメーション全体の論理的一貫性や視覚的連続性を著しく損なう可能性がある。これらの課題を克服するため、我々はAniMakerを提案する。これは、テキスト入力のみからグローバルに一貫したストーリー性のあるアニメーションを生成するために、効率的なマルチ候補クリップ生成とストーリーを意識したクリップ選択を可能にするマルチエージェントフレームワークである。このフレームワークは、ストーリーボード生成を担当するディレクターエージェント、ビデオクリップ生成を担当するフォトグラフィーエージェント、評価を担当するレビュアーエージェント、編集とボイスオーバーを担当するポストプロダクションエージェントといった専門エージェントを中心に構成されている。AniMakerのアプローチの中核となるのは、フォトグラフィーエージェント内のMCTS-Genと、レビュアーエージェント内のAniEvalという2つの主要な技術的要素である。MCTS-Genは、モンテカルロ木探索（MCTS）にインスパイアされた効率的な戦略であり、リソース使用を最適化しながら高ポテンシャルなクリップを生成するために候補空間をインテリジェントにナビゲートする。AniEvalは、マルチショットアニメーション評価に特化した初のフレームワークであり、各クリップをその前後のクリップの文脈で考慮することで、ストーリーレベルの一貫性、アクションの完了度、アニメーション固有の特徴といった重要な側面を評価する。実験結果は、AniMakerがVBenchや我々が提案するAniEvalフレームワークなどの一般的な指標で測定される品質において優れていることを示し、マルチ候補生成の効率を大幅に向上させ、AI生成のストーリーテリングアニメーションをプロダクション基準に近づけることを実証している。

English

Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.

AniMaker: MCTS駆動型クリップ生成による自動化マルチエージェントアニメーションストーリーテリング

AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation

要旨

Support