AnyMo: 마스크 모델링을 통한 모든 모달리티 조건부 모션 생성의 확장

초록

조건부 인간 동작 생성은 컴퓨터 비전과 로보틱스에서 여전히 근본적인 과제로 남아 있습니다. 상당한 진전에도 불구하고, 현재 방법들은 고정된 모달리티 구성과 작업별 아키텍처에 의해 종종 제약되어, 교차 모달 상호작용과 다중 모달 조건부 합성의 스케일링 법칙은 대체로 탐구되지 않은 상태로 남아 있습니다. 핵심 병목 현상은 대규모 모달리티 정렬 동작 데이터의 부족으로, 다양한 제어 신호에 걸친 일반화를 제한합니다. 본 연구에서는 5,000시간 이상의 동작과 320만 개의 시퀀스로 구성된 대규모 고품질 데이터셋 OmniHuMo를 소개하며, 이는 정밀하게 정렬된 다중 모달 주석(예: 텍스트, 음성, 음악, 궤적)을 포함합니다. OmniHuMo를 활용하여, 우리는 Residual FSQ 기반 동작 토크나이저와 확장 가능한 마스크 모델링 트랜스포머를 결합한 통합 다중 모달 프레임워크 AnyMo를 제안하며, 이를 통해 임의의 모달리티 조합 하에서 고품질 동작 합성을 가능하게 합니다. 광범위한 실험을 통해 AnyMo가 공간적 및 스타일적 속성 모두에 대한 유연한 제어를 제공하면서 고충실도 합성을 달성함을 보여줍니다.

English

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.