MoZoo: ビデオ拡散の力を動物の毛皮と筋肉シミュレーションに解き放つ

要旨

映画品質の動物表現を生成するには、筋肉や毛皮のダイナミクスを精緻にモデリングする必要があり、従来の制作パイプラインでは多大な労力と計算コストを要してきました。生成拡散モデルは多様な芸術的ワークフローにおいて有望性を示していますが、高忠実度な動物シミュレーションへの活用は未だ十分に探求されていません。本稿では、従来のリファインメントを介さずに、粗いメッシュからマルチモーダルガイダンスに基づいて高忠実度な動物動画を合成する生成型ダイナミクス解法MoZooを提案します。我々は、ロールアウェアRoPE（RAR-RoPE）を導入し、ロールベースのインデックス再マッピングにより動作の位置合わせを同期させつつ、固定された時間オフセットによって参照情報を分離します。これを補完する非対称分離注意機構は、潜在系列を分割して一方向の情報流を強制することで、特徴間の干渉を防ぎ計算効率を向上させます。高品質な学習データの不足に対処するため、レンダリングエンジンと逆マッピング手法を活用し、ペア化された大規模系列データセットを構築する合成-to-実パイプラインMoZoo-Dataを導入します。さらに、120組のメッシュ・動画ペアからなる包括的ベンチマークMoZooBenchを構築しました。実験結果は、MoZooが多様な動物の骨格およびレイアウトにわたって高忠実度な毛皮シミュレーションを実現し、優れた時間的および構造的一貫性を維持することを示しています。

English

The creation of cinematic-quality animal effects necessitates the precise modeling of muscle and fur dynamics, a process that remains both labor-intensive and computationally expensive within traditional production workflows. While generative diffusion models have shown promise in diverse artistic workflows, their capacity for high-fidelity animal simulation remains largely unexploited. We present MoZoo, a generative dynamics solver that bypasses conventional refinement to synthesize high-fidelity animal videos from coarse meshes under multimodal guidance. We propose Role-Aware RoPE (RAR-RoPE) which employs role-based index remapping to synchronize motion alignment while decoupling reference information via fixed temporal offsets. Complementing this, Asymmetric Decoupled Attention partitions the latent sequence to enforce a unidirectional information flow, effectively preventing feature interference and improving computational efficiency. To address the scarcity of high-quality training data, we introduce MoZoo-Data, a synthetic-to-real pipeline that leverages a rendering engine and an inverse mapping approach to construct a large-scale dataset of paired sequences. Furthermore, we establish MoZooBench, a comprehensive benchmark with 120 mesh-video pairs. Experimental results demonstrate that MoZoo achieves high-fidelity fur simulation across diverse animal skeletons and layouts, preserving superior temporal and structural consistency.