动漫射手：面向参考引导视频生成的多镜头动画数据集

摘要

近期，AI生成内容（AIGC）的进展显著加速了动画制作。要创作引人入胜的动画，关键在于生成连贯的多镜头视频片段，并配以叙事脚本和角色参考。然而，现有的公开数据集主要集中于现实场景的全局描述，缺乏用于角色一致性引导的参考图像。为填补这一空白，我们推出了AnimeShooter，一个参考引导的多镜头动画数据集。AnimeShooter通过自动化流程，实现了全面的层次化标注和镜头间强烈的视觉一致性。故事级标注提供了叙事概览，包括故事情节、关键场景及带有参考图像的主要角色简介；而镜头级标注则将故事分解为连续镜头，每个镜头均标注了场景、角色，以及叙事性和描述性的视觉字幕。此外，专门子集AnimeShooter-audio为每个镜头提供了同步音轨，包含音频描述和音源信息。为展示AnimeShooter的有效性，并为参考引导的多镜头视频生成任务设立基准，我们引入了AnimeShooterGen，它结合了多模态大语言模型（MLLMs）和视频扩散模型。参考图像及先前生成的镜头首先由MLLM处理，生成同时感知参考与上下文的表示，随后作为扩散模型的条件，解码出后续镜头。实验结果表明，基于AnimeShooter训练的模型在跨镜头视觉一致性和遵循参考视觉引导方面表现卓越，凸显了本数据集在连贯动画视频生成中的价值。

English

Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions. Additionally, a dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources. To demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs) and video diffusion models. The reference image and previously generated shots are first processed by MLLM to produce representations aware of both reference and context, which are then used as the condition for the diffusion model to decode the subsequent shot. Experimental results show that the model trained on AnimeShooter achieves superior cross-shot visual consistency and adherence to reference visual guidance, which highlight the value of our dataset for coherent animated video generation.

动漫射手：面向参考引导视频生成的多镜头动画数据集

AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

摘要

Support