AnyMo:基於遮罩建模的任意模態條件式動作生成擴展
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
May 28, 2026
作者: Yiheng Li, Zhuo Li, Ruibing Hou, Yingjie Chen, Hong Chang, Hao Liu, Shiguang Shan
cs.AI
摘要
條件式人體動作生成仍是計算機視覺與機器人學中的一項基本挑戰。儘管已有顯著進展,現有方法常受限於固定模態配置與任務特定架構,導致跨模態互動以及多模態條件合成中的尺度法則仍 largely 未被充分探索。其中一個關鍵瓶頸在於缺乏大規模、模態對齊的動作數據,這限制了模型在各種控制訊號上的泛化能力。在本工作中,我們提出 OmniHuMo——一個大規模、高品質的資料集,包含超過 5000 小時的動作數據與 320 萬個序列,並配備精確對齊的多模態標註(例如文字、語音、音樂與軌跡)。基於 OmniHuMo,我們進一步提出 AnyMo——一個統一的多模態框架,結合基於殘差 FSQ 的動作標記器與可擴展的遮罩建模轉換器,能在任意模態組合下實現高品質動作合成。大量實驗顯示,AnyMo 不僅能達到高保真度合成,還能在空間與風格屬性上提供靈活控制。
English
Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.