童繪奇緣:單一兒童繪製角色生成故事動畫
FairyGen: Storied Cartoon Video from a Single Child-Drawn Character
June 26, 2025
作者: Jiayi Zheng, Xiaodong Cun
cs.AI
摘要
我們提出FairyGen,這是一個從單張兒童繪畫自動生成故事驅動卡通視頻的系統,同時忠實保留其獨特的藝術風格。與以往主要關注角色一致性和基本動作的敘事方法不同,FairyGen明確地將角色建模與風格化背景生成分離,並融入電影鏡頭設計,以支持富有表現力且連貫的敘事。給定一張角色草圖,我們首先利用MLLM生成一個結構化的故事板,其中包含鏡頭級別的描述,詳細說明環境設定、角色動作和攝像機視角。為了確保視覺一致性,我們引入了一種風格傳播適配器,它捕捉角色的視覺風格並將其應用於背景,在合成風格一致的場景的同時,忠實保留角色的完整視覺身份。鏡頭設計模塊通過基於故事板的畫面裁剪和多視角合成,進一步增強了視覺多樣性和電影質量。為了動畫化故事,我們重建了角色的3D代理,以導出物理上合理的動作序列,然後用於微調基於MMDiT的圖像到視頻擴散模型。我們進一步提出了一種兩階段動作定製適配器:第一階段從時間上無序的幀中學習外觀特徵,將身份與動作分離;第二階段使用時間步移策略和凍結身份權重來建模時間動態。一旦訓練完成,FairyGen可以直接渲染與故事板對齊的多樣且連貫的視頻場景。大量實驗表明,我們的系統生成的動畫在風格上忠實,敘事結構自然,動作流暢,凸顯了其在個性化和引人入勝的故事動畫中的潛力。代碼將在https://github.com/GVCLab/FairyGen 提供。
English
We propose FairyGen, an automatic system for generating story-driven cartoon
videos from a single child's drawing, while faithfully preserving its unique
artistic style. Unlike previous storytelling methods that primarily focus on
character consistency and basic motion, FairyGen explicitly disentangles
character modeling from stylized background generation and incorporates
cinematic shot design to support expressive and coherent storytelling. Given a
single character sketch, we first employ an MLLM to generate a structured
storyboard with shot-level descriptions that specify environment settings,
character actions, and camera perspectives. To ensure visual consistency, we
introduce a style propagation adapter that captures the character's visual
style and applies it to the background, faithfully retaining the character's
full visual identity while synthesizing style-consistent scenes. A shot design
module further enhances visual diversity and cinematic quality through frame
cropping and multi-view synthesis based on the storyboard. To animate the
story, we reconstruct a 3D proxy of the character to derive physically
plausible motion sequences, which are then used to fine-tune an MMDiT-based
image-to-video diffusion model. We further propose a two-stage motion
customization adapter: the first stage learns appearance features from
temporally unordered frames, disentangling identity from motion; the second
stage models temporal dynamics using a timestep-shift strategy with frozen
identity weights. Once trained, FairyGen directly renders diverse and coherent
video scenes aligned with the storyboard. Extensive experiments demonstrate
that our system produces animations that are stylistically faithful,
narratively structured natural motion, highlighting its potential for
personalized and engaging story animation. The code will be available at
https://github.com/GVCLab/FairyGen