AnyMo: 基于掩码建模的任意模态条件化运动生成扩展

摘要

条件性人体运动生成仍是计算机视觉与机器人领域的一项基础性挑战。尽管已取得显著进展，现有方法往往受限于固定的模态配置和任务特定架构，跨模态交互以及多模态条件合成中的缩放规律在很大程度上仍未被充分探索。关键瓶颈在于缺乏大规模模态对齐的运动数据，这限制了模型在不同控制信号间的泛化能力。在本工作中，我们提出OmniHuMo——一个大规模、高质量的数据集，包含超过5000小时的运动数据和320万条序列，并配有精确对齐的多模态标注（如文本、语音、音乐和轨迹）。基于OmniHuMo，我们提出AnyMo——一个统一的多模态框架，结合了基于残差FSQ的运动分词器与可扩展的掩码建模变换器，能够在任意模态组合下实现高质量运动合成。大量实验表明，AnyMo在实现高保真合成的同时，还能对空间属性和风格属性提供灵活控制。

English

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.