FairyGen: 단일 아동 그림 캐릭터로부터 스토리텔링 만화 영상 생성

초록

우리는 어린이의 단일 그림으로부터 스토리 중심의 만화 비디오를 생성하면서도 독특한 예술적 스타일을 충실히 보존하는 자동 시스템인 FairyGen을 제안한다. 기존의 스토리텔링 방법들이 주로 캐릭터 일관성과 기본적인 동작에 초점을 맞추는 것과 달리, FairyGen은 캐릭터 모델링을 스타일화된 배경 생성과 명확히 분리하고, 표현력 있고 일관된 스토리텔링을 지원하기 위해 시네마틱 샷 디자인을 통합한다. 단일 캐릭터 스케치가 주어지면, 우리는 먼저 MLLM을 사용하여 환경 설정, 캐릭터 동작, 카메라 시점 등을 지정하는 샷 수준의 설명이 포함된 구조화된 스토리보드를 생성한다. 시각적 일관성을 보장하기 위해, 우리는 캐릭터의 시각적 스타일을 포착하고 이를 배경에 적용하는 스타일 전파 어댑터를 도입하여, 스타일 일관된 장면을 합성하면서도 캐릭터의 완전한 시각적 정체성을 충실히 유지한다. 샷 디자인 모듈은 스토리보드를 기반으로 프레임 크롭핑과 다중 뷰 합성을 통해 시각적 다양성과 시네마틱 품질을 더욱 향상시킨다. 스토리를 애니메이션화하기 위해, 우리는 캐릭터의 3D 프록시를 재구성하여 물리적으로 타당한 동작 시퀀스를 도출하고, 이를 MMDiT 기반의 이미지-투-비디오 확산 모델을 미세 조정하는 데 사용한다. 우리는 또한 두 단계의 동작 커스터마이제이션 어댑터를 제안한다: 첫 번째 단계는 시간적으로 정렬되지 않은 프레임에서 외형 특징을 학습하여 동작과 정체성을 분리하고, 두 번째 단계는 고정된 정체성 가중치를 사용한 타임스텝-시프트 전략으로 시간적 역학을 모델링한다. 일단 학습이 완료되면, FairyGen은 스토리보드와 일치하는 다양하고 일관된 비디오 장면을 직접 렌더링한다. 광범위한 실험을 통해 우리의 시스템이 스타일적으로 충실하고, 서사적으로 구조화된 자연스러운 동작을 가진 애니메이션을 생성함을 입증하며, 이는 개인화되고 매력적인 스토리 애니메이션을 위한 잠재력을 강조한다. 코드는 https://github.com/GVCLab/FairyGen에서 확인할 수 있다.

English

We propose FairyGen, an automatic system for generating story-driven cartoon videos from a single child's drawing, while faithfully preserving its unique artistic style. Unlike previous storytelling methods that primarily focus on character consistency and basic motion, FairyGen explicitly disentangles character modeling from stylized background generation and incorporates cinematic shot design to support expressive and coherent storytelling. Given a single character sketch, we first employ an MLLM to generate a structured storyboard with shot-level descriptions that specify environment settings, character actions, and camera perspectives. To ensure visual consistency, we introduce a style propagation adapter that captures the character's visual style and applies it to the background, faithfully retaining the character's full visual identity while synthesizing style-consistent scenes. A shot design module further enhances visual diversity and cinematic quality through frame cropping and multi-view synthesis based on the storyboard. To animate the story, we reconstruct a 3D proxy of the character to derive physically plausible motion sequences, which are then used to fine-tune an MMDiT-based image-to-video diffusion model. We further propose a two-stage motion customization adapter: the first stage learns appearance features from temporally unordered frames, disentangling identity from motion; the second stage models temporal dynamics using a timestep-shift strategy with frozen identity weights. Once trained, FairyGen directly renders diverse and coherent video scenes aligned with the storyboard. Extensive experiments demonstrate that our system produces animations that are stylistically faithful, narratively structured natural motion, highlighting its potential for personalized and engaging story animation. The code will be available at https://github.com/GVCLab/FairyGen

FairyGen: 단일 아동 그림 캐릭터로부터 스토리텔링 만화 영상 생성

FairyGen: Storied Cartoon Video from a Single Child-Drawn Character

초록

Support