DreamActor-M1：ハイブリッドガイダンスによるホリスティックで表現力豊かかつロバストな人物画像アニメーション

要旨

近年の画像ベースの人間アニメーション手法は、現実的な身体と顔の動きの合成を実現していますが、細粒度の全体的な制御性、マルチスケール適応性、長期的な時間的一貫性において重要な課題が残っており、表現力とロバスト性の低下を招いています。我々は、これらの制限を克服するために、ハイブリッドガイダンスを備えた拡散トランスフォーマー（DiT）ベースのフレームワーク、DreamActor-M1を提案します。モーションガイダンスにおいて、暗黙的な顔表現、3D頭部球体、3D身体骨格を統合したハイブリッド制御信号を用いることで、表情と身体の動きをロバストに制御しつつ、表現力豊かでアイデンティティを保持したアニメーションを生成します。スケール適応においては、ポートレートから全身ビューまでの様々な身体ポーズと画像スケールに対応するため、異なる解像度とスケールのデータを使用した段階的なトレーニング戦略を採用します。外観ガイダンスにおいては、連続フレームからのモーションパターンを補完的な視覚的参照と統合し、複雑な動き中の未見領域に対する長期的な時間的一貫性を確保します。実験結果は、我々の手法が最先端の研究を上回り、ポートレート、上半身、全身生成において表現力豊かな結果を提供し、長期的な一貫性をロバストに実現することを示しています。プロジェクトページ: https://grisoon.github.io/DreamActor-M1/。

English

While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations. For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales. For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements. Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency. Project Page: https://grisoon.github.io/DreamActor-M1/.