애니메이터: 참조 기반 비디오 생성을 위한 다중 샷 애니메이션 데이터셋

초록

최근 AI 생성 콘텐츠(AIGC)의 발전으로 애니메이션 제작 속도가 크게 빨라졌습니다. 매력적인 애니메이션을 제작하기 위해서는 내러티브 스크립트와 캐릭터 참조를 포함한 일관된 다중 샷 비디오 클립을 생성하는 것이 필수적입니다. 그러나 기존의 공개 데이터셋은 주로 전반적인 설명이 포함된 실제 시나리오에 초점을 맞추고 있으며, 일관된 캐릭터 안내를 위한 참조 이미지가 부족합니다. 이러한 격차를 해소하기 위해, 우리는 참조 기반 다중 샷 애니메이션 데이터셋인 AnimeShooter를 소개합니다. AnimeShooter는 자동화된 파이프라인을 통해 포괄적인 계층적 주석과 샷 간의 강력한 시각적 일관성을 제공합니다. 스토리 수준의 주석은 스토리라인, 주요 장면, 참조 이미지가 포함된 주요 캐릭터 프로필 등 내러티브 개요를 제공하며, 샷 수준의 주석은 스토리를 연속적인 샷으로 분해하여 각 샷에 장면, 캐릭터, 내러티브 및 시각적 설명 캡션을 추가합니다. 또한, 전용 하위 집합인 AnimeShooter-audio는 각 샷에 대한 동기화된 오디오 트랙과 오디오 설명 및 사운드 소스를 제공합니다. AnimeShooter의 효과를 입증하고 참조 기반 다중 샷 비디오 생성 작업을 위한 기준을 설정하기 위해, 우리는 다중 모드 대형 언어 모델(MLLM)과 비디오 확산 모델을 활용한 AnimeShooterGen을 소개합니다. 참조 이미지와 이전에 생성된 샷은 먼저 MLLM에 의해 처리되어 참조와 컨텍스트를 모두 인식하는 표현을 생성한 후, 이를 확산 모델의 조건으로 사용하여 다음 샷을 디코딩합니다. 실험 결과, AnimeShooter에서 훈련된 모델은 샷 간의 뛰어난 시각적 일관성과 참조 시각적 안내에 대한 충실도를 보여주며, 이는 우리 데이터셋이 일관된 애니메이션 비디오 생성에 있어 가치가 있음을 강조합니다.

English

Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions. Additionally, a dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources. To demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs) and video diffusion models. The reference image and previously generated shots are first processed by MLLM to produce representations aware of both reference and context, which are then used as the condition for the diffusion model to decode the subsequent shot. Experimental results show that the model trained on AnimeShooter achieves superior cross-shot visual consistency and adherence to reference visual guidance, which highlight the value of our dataset for coherent animated video generation.

애니메이터: 참조 기반 비디오 생성을 위한 다중 샷 애니메이션 데이터셋

AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

초록

Support