FairyGen:从单一儿童手绘角色生成故事卡通视频
FairyGen: Storied Cartoon Video from a Single Child-Drawn Character
June 26, 2025
作者: Jiayi Zheng, Xiaodong Cun
cs.AI
摘要
我们提出FairyGen,一个从单幅儿童绘画自动生成故事驱动卡通视频的系统,同时忠实保留其独特的艺术风格。与以往主要关注角色一致性和基础动作的叙事方法不同,FairyGen明确地将角色建模与风格化背景生成分离,并融入电影镜头设计,以支持富有表现力且连贯的叙事。给定一幅角色草图,我们首先利用MLLM生成带有镜头级描述的结构化故事板,这些描述详细说明了环境设定、角色动作及摄像机视角。为确保视觉一致性,我们引入了一种风格传播适配器,捕捉角色的视觉风格并将其应用于背景,在合成风格一致的场景时,忠实保留角色的完整视觉特征。镜头设计模块通过基于故事板的画面裁剪和多视角合成,进一步提升了视觉多样性和电影质感。为动画化故事,我们重建角色的3D代理以推导物理上合理的动作序列,随后用于微调基于MMDiT的图像到视频扩散模型。我们还提出了一种两阶段动作定制适配器:第一阶段从时间无序的帧中学习外观特征,分离身份与动作;第二阶段采用时间步移策略,在固定身份权重的情况下建模时间动态。训练完成后,FairyGen能直接渲染与故事板对齐的多样且连贯的视频场景。大量实验表明,我们的系统生成的动画在风格上忠实,叙事结构自然,动作流畅,凸显了其在个性化且引人入胜的故事动画中的潜力。代码将发布于https://github.com/GVCLab/FairyGen。
English
We propose FairyGen, an automatic system for generating story-driven cartoon
videos from a single child's drawing, while faithfully preserving its unique
artistic style. Unlike previous storytelling methods that primarily focus on
character consistency and basic motion, FairyGen explicitly disentangles
character modeling from stylized background generation and incorporates
cinematic shot design to support expressive and coherent storytelling. Given a
single character sketch, we first employ an MLLM to generate a structured
storyboard with shot-level descriptions that specify environment settings,
character actions, and camera perspectives. To ensure visual consistency, we
introduce a style propagation adapter that captures the character's visual
style and applies it to the background, faithfully retaining the character's
full visual identity while synthesizing style-consistent scenes. A shot design
module further enhances visual diversity and cinematic quality through frame
cropping and multi-view synthesis based on the storyboard. To animate the
story, we reconstruct a 3D proxy of the character to derive physically
plausible motion sequences, which are then used to fine-tune an MMDiT-based
image-to-video diffusion model. We further propose a two-stage motion
customization adapter: the first stage learns appearance features from
temporally unordered frames, disentangling identity from motion; the second
stage models temporal dynamics using a timestep-shift strategy with frozen
identity weights. Once trained, FairyGen directly renders diverse and coherent
video scenes aligned with the storyboard. Extensive experiments demonstrate
that our system produces animations that are stylistically faithful,
narratively structured natural motion, highlighting its potential for
personalized and engaging story animation. The code will be available at
https://github.com/GVCLab/FairyGen