AniMaker: MCTS 기반 클립 생성을 통한 자동화된 다중 에이전트 애니메이션 스토리텔링

초록

비디오 생성 모델의 급속한 발전에도 불구하고, 여러 장면과 캐릭터를 아우르는 일관된 스토리텔링 비디오를 생성하는 것은 여전히 어려운 과제로 남아 있습니다. 현재의 방법들은 미리 생성된 키프레임을 고정된 길이의 클립으로 경직되게 변환하는 경우가 많아, 단절된 내러티브와 페이싱 문제를 초래합니다. 더욱이, 비디오 생성 모델의 고유한 불안정성으로 인해 단 하나의 저품질 클립이라도 전체 애니메이션의 논리적 일관성과 시각적 연속성을 크게 저하시킬 수 있습니다. 이러한 장애물을 극복하기 위해, 우리는 AniMaker를 소개합니다. 이는 다중 후보 클립 생성과 스토리텔링 인식 클립 선택을 가능하게 하는 다중 에이전트 프레임워크로, 텍스트 입력만으로 전역적으로 일관되고 스토리 일관성이 있는 애니메이션을 생성합니다. 이 프레임워크는 스토리보드 생성을 담당하는 Director Agent, 비디오 클립 생성을 담당하는 Photography Agent, 평가를 담당하는 Reviewer Agent, 그리고 편집 및 보이스오버를 담당하는 Post-Production Agent와 같은 전문화된 에이전트들로 구성됩니다. AniMaker의 접근 방식에서 핵심적인 두 가지 기술 구성 요소는 다음과 같습니다: Photography Agent의 MCTS-Gen은 몬테카를로 트리 탐색(MCTS)에서 영감을 받은 효율적인 전략으로, 후보 공간을 지능적으로 탐색하여 고품질 클립을 생성하면서 자원 사용을 최적화합니다; 그리고 Reviewer Agent의 AniEval은 다중 샷 애니메이션 평가를 위해 특별히 설계된 첫 번째 프레임워크로, 각 클립을 이전 및 이후 클립과의 맥락에서 고려하여 스토리 수준의 일관성, 액션 완료도, 애니메이션 특유의 특징 등 중요한 측면을 평가합니다. 실험 결과, AniMaker는 VBench 및 우리가 제안한 AniEval 프레임워크를 포함한 인기 지표에서 우수한 품질을 달성했으며, 다중 후보 생성의 효율성을 크게 개선하여 AI 생성 스토리텔링 애니메이션을 프로덕션 수준에 더 가깝게 끌어올렸습니다.

English

Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.

AniMaker: MCTS 기반 클립 생성을 통한 자동화된 다중 에이전트 애니메이션 스토리텔링

AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation

초록

Support