RealisDance-DiT：ワイルド環境における制御可能なキャラクターアニメーションに向けたシンプルかつ強力なベースライン

要旨

制御可能なキャラクターアニメーションは依然として難しい課題であり、特に稀なポーズ、スタイライズされたキャラクター、キャラクターとオブジェクトの相互作用、複雑な照明、動的なシーンを扱う点で困難が残っています。これらの問題に対処するため、従来の研究では主に精巧なバイパスネットワークを通じてポーズや外観のガイダンスを注入することに焦点を当ててきましたが、オープンワールドのシナリオに一般化するのは難しい場合が多かったです。本論文では、基盤モデルが十分に強力であれば、シンプルなモデルの修正と柔軟なファインチューニング戦略によって、上記の課題を大きく解決できるという新しい視点を提案し、実世界での制御可能なキャラクターアニメーションに向けて一歩を踏み出します。具体的には、Wan-2.1ビデオ基盤モデルを基にしたRealisDance-DiTを紹介します。私たちの十分な分析により、広く採用されているReference Netの設計が大規模なDiTモデルにとって最適ではないことが明らかになりました。代わりに、基盤モデルのアーキテクチャに最小限の修正を加えることで、驚くほど強力なベースラインが得られることを示します。さらに、ファインチューニング中のモデルの収束を加速しつつ、基盤モデルの事前知識を最大限に保持するために、低ノイズウォームアップと「大きなバッチと小さなイテレーション」戦略を提案します。加えて、TikTokデータセットやUBCファッションビデオデータセットなどの既存のベンチマークを補完し、提案手法を包括的に評価するために、多様な実世界の課題を捉えた新しいテストデータセットを導入します。大規模な実験により、RealisDance-DiTが既存の手法を大きく上回ることを示します。

English

Controllable character animation remains a challenging problem, particularly in handling rare poses, stylized characters, character-object interactions, complex illumination, and dynamic scenes. To tackle these issues, prior work has largely focused on injecting pose and appearance guidance via elaborate bypass networks, but often struggles to generalize to open-world scenarios. In this paper, we propose a new perspective that, as long as the foundation model is powerful enough, straightforward model modifications with flexible fine-tuning strategies can largely address the above challenges, taking a step towards controllable character animation in the wild. Specifically, we introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model. Our sufficient analysis reveals that the widely adopted Reference Net design is suboptimal for large-scale DiT models. Instead, we demonstrate that minimal modifications to the foundation model architecture yield a surprisingly strong baseline. We further propose the low-noise warmup and "large batches and small iterations" strategies to accelerate model convergence during fine-tuning while maximally preserving the priors of the foundation model. In addition, we introduce a new test dataset that captures diverse real-world challenges, complementing existing benchmarks such as TikTok dataset and UBC fashion video dataset, to comprehensively evaluate the proposed method. Extensive experiments show that RealisDance-DiT outperforms existing methods by a large margin.

RealisDance-DiT：ワイルド環境における制御可能なキャラクターアニメーションに向けたシンプルかつ強力なベースライン

RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild

要旨

Support