AnyMo: マスクモデリングによる任意モダリティ条件付き動作生成のスケーリング

要旨

条件付き人間動作生成は、コンピュータビジョンとロボティクスにおける根本的な課題であり続けている。近年の目覚ましい進歩にもかかわらず、現在の手法は固定されたモダリティ構成やタスク固有のアーキテクチャに制約されることが多く、クロスモーダルな相互作用やマルチモーダル条件下での合成におけるスケーリング則はほとんど未探求のままである。主要なボトルネックは、大規模なモダリティ整合動作データの不足であり、多様な制御信号に対する汎化を制限している。本研究では、5000時間以上の動作と320万シーケンスから成り、テキスト、音声、音楽、軌道といったマルチモーダルアノテーションが精密に整合された、大規模高品質データセットOmniHuMoを導入する。OmniHuMoを活用し、Residual FSQに基づく動作トークナイザとスケーラブルなマスク付きモデリングトランスフォーマーを組み合わせた統合マルチモーダルフレームワークAnyMoを提案する。AnyMoは任意のモダリティ組み合わせの下で高品質な動作合成を実現する。広範な実験により、AnyMoが空間的およびスタイル的属性の両方に対して柔軟な制御を提供しつつ、高忠実度の合成を達成することが示された。

English

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.