ChatPaper.aiChatPaper

通用化运动生成的探索:数据、模型与评估体系

The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

October 30, 2025
作者: Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu
cs.AI

摘要

尽管三维人体运动生成(MoGen)在标准基准测试中取得了最新进展,现有模型在泛化能力方面仍面临根本性瓶颈。相比之下,邻近的生成领域(尤其是视频生成ViGen)在人体行为建模方面展现出卓越的泛化能力,这为MoGen提供了可迁移的启示。基于这一观察,我们提出了一个综合框架,系统性地从数据、建模和评估三大支柱将ViGen的知识迁移至MoGen。首先,我们推出ViMoGen-228K大规模数据集,包含22.8万个高质量运动样本,融合了高精度光学运动捕捉数据、来自网络视频的语义标注动作,以及顶尖ViGen模型生成的合成样本。该数据集同时包含文本-运动配对和文本-视频-运动三元组,显著扩展了语义多样性。其次,我们提出ViMoGen——基于流匹配的扩散Transformer模型,通过门控多模态条件机制统一MoCap数据与ViGen模型的先验知识。为提升效率,我们进一步开发ViMoGen-light蒸馏变体,在保持强泛化能力的同时消除对视频生成的依赖。最后,我们推出MBench分层基准测试体系,支持运动质量、提示符保真度和泛化能力的细粒度评估。大量实验表明,我们的框架在自动评估和人工评估中均显著超越现有方法。代码、数据和基准测试将公开提供。
English
Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.
PDF261December 2, 2025