AnyMo：基於遮罩建模的任意模態條件式動作生成擴展

摘要

條件式人體動作生成仍是計算機視覺與機器人學中的一項基本挑戰。儘管已有顯著進展，現有方法常受限於固定模態配置與任務特定架構，導致跨模態互動以及多模態條件合成中的尺度法則仍 largely 未被充分探索。其中一個關鍵瓶頸在於缺乏大規模、模態對齊的動作數據，這限制了模型在各種控制訊號上的泛化能力。在本工作中，我們提出 OmniHuMo——一個大規模、高品質的資料集，包含超過 5000 小時的動作數據與 320 萬個序列，並配備精確對齊的多模態標註（例如文字、語音、音樂與軌跡）。基於 OmniHuMo，我們進一步提出 AnyMo——一個統一的多模態框架，結合基於殘差 FSQ 的動作標記器與可擴展的遮罩建模轉換器，能在任意模態組合下實現高品質動作合成。大量實驗顯示，AnyMo 不僅能達到高保真度合成，還能在空間與風格屬性上提供靈活控制。

English

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.