AnimeShooter:一個用於參考引導影片生成的多鏡頭動畫數據集
AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation
June 3, 2025
作者: Lu Qiu, Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Xihui Liu
cs.AI
摘要
近期,AI生成内容(AIGC)的進展顯著加速了動畫製作。要製作引人入勝的動畫,關鍵在於生成具有敘事腳本和角色參考的連貫多鏡頭視頻片段。然而,現有的公開數據集主要集中於現實世界場景的全局描述,缺乏用於一致角色指導的參考圖像。為彌補這一差距,我們推出了AnimeShooter,這是一個參考引導的多鏡頭動畫數據集。AnimeShooter通過自動化流程,具備全面的層次化註釋和跨鏡頭的強視覺一致性。故事級註釋提供了敘事概覽,包括故事情節、關鍵場景和帶有參考圖像的主要角色簡介,而鏡頭級註釋則將故事分解為連續的鏡頭,每個鏡頭都標註了場景、角色以及敘事性和描述性的視覺字幕。此外,專用子集AnimeShooter-audio為每個鏡頭提供了同步音軌,以及音頻描述和聲音來源。為展示AnimeShooter的有效性並為參考引導的多鏡頭視頻生成任務建立基準,我們引入了AnimeShooterGen,它利用多模態大語言模型(MLLMs)和視頻擴散模型。參考圖像和先前生成的鏡頭首先由MLLM處理,生成既考慮參考又考慮上下文的表示,然後將其作為擴散模型的條件來解碼後續鏡頭。實驗結果表明,基於AnimeShooter訓練的模型在跨鏡頭視覺一致性和遵循參考視覺指導方面表現優異,這凸顯了我們數據集在生成連貫動畫視頻方面的價值。
English
Recent advances in AI-generated content (AIGC) have significantly accelerated
animation production. To produce engaging animations, it is essential to
generate coherent multi-shot video clips with narrative scripts and character
references. However, existing public datasets primarily focus on real-world
scenarios with global descriptions, and lack reference images for consistent
character guidance. To bridge this gap, we present AnimeShooter, a
reference-guided multi-shot animation dataset. AnimeShooter features
comprehensive hierarchical annotations and strong visual consistency across
shots through an automated pipeline. Story-level annotations provide an
overview of the narrative, including the storyline, key scenes, and main
character profiles with reference images, while shot-level annotations
decompose the story into consecutive shots, each annotated with scene,
characters, and both narrative and descriptive visual captions. Additionally, a
dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each
shot, along with audio descriptions and sound sources. To demonstrate the
effectiveness of AnimeShooter and establish a baseline for the reference-guided
multi-shot video generation task, we introduce AnimeShooterGen, which leverages
Multimodal Large Language Models (MLLMs) and video diffusion models. The
reference image and previously generated shots are first processed by MLLM to
produce representations aware of both reference and context, which are then
used as the condition for the diffusion model to decode the subsequent shot.
Experimental results show that the model trained on AnimeShooter achieves
superior cross-shot visual consistency and adherence to reference visual
guidance, which highlight the value of our dataset for coherent animated video
generation.