AnimeShooter: 参照ガイド型動画生成のためのマルチショットアニメーションデータセット

要旨

近年のAI生成コンテンツ（AIGC）の進展により、アニメーション制作が大幅に加速しています。魅力的なアニメーションを制作するためには、物語の脚本とキャラクターの参照画像を伴った一貫性のあるマルチショット動画クリップを生成することが不可欠です。しかし、既存の公開データセットは主に現実世界のシナリオに焦点を当てており、グローバルな記述が中心で、一貫したキャラクターガイダンスのための参照画像が不足しています。このギャップを埋めるため、我々はAnimeShooterを提案します。これは参照画像を基にしたマルチショットアニメーションデータセットです。AnimeShooterは、自動化されたパイプラインを通じて、包括的な階層的アノテーションとショット間の強力な視覚的一貫性を特徴としています。ストーリーレベルのアノテーションは、物語の概要、キーシーン、参照画像を含む主要キャラクターのプロファイルを提供し、ショットレベルのアノテーションは物語を連続するショットに分解し、各ショットにシーン、キャラクター、物語的および記述的な視覚キャプションを付与します。さらに、専用のサブセットであるAnimeShooter-audioは、各ショットの同期されたオーディオトラックと、オーディオ記述および音源を提供します。AnimeShooterの有効性を実証し、参照画像を基にしたマルチショット動画生成タスクのベースラインを確立するため、我々はAnimeShooterGenを導入します。これは、マルチモーダル大規模言語モデル（MLLM）とビデオ拡散モデルを活用しています。参照画像と以前に生成されたショットは、まずMLLMによって処理され、参照とコンテキストを意識した表現を生成し、その後、拡散モデルの条件として使用され、次のショットをデコードします。実験結果は、AnimeShooterで訓練されたモデルが、ショット間の視覚的一貫性と参照視覚ガイダンスへの忠実さにおいて優れていることを示しており、我々のデータセットが一貫性のあるアニメーションビデオ生成に価値があることを強調しています。

English

Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions. Additionally, a dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources. To demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs) and video diffusion models. The reference image and previously generated shots are first processed by MLLM to produce representations aware of both reference and context, which are then used as the condition for the diffusion model to decode the subsequent shot. Experimental results show that the model trained on AnimeShooter achieves superior cross-shot visual consistency and adherence to reference visual guidance, which highlight the value of our dataset for coherent animated video generation.

AnimeShooter: 参照ガイド型動画生成のためのマルチショットアニメーションデータセット

AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

要旨

Support