MoMask：生成式遮蔽建模3D人体动作

摘要

我们介绍了MoMask，这是一个用于基于文本驱动的3D人体运动生成的新型遮罩建模框架。在MoMask中，采用了分层量化方案来将人体运动表示为具有高保真度细节的多层离散运动标记。从基础层开始，通过矢量量化获得的一系列运动标记，派生并存储了逐渐增加阶次的残差标记，并存储在层次结构的后续层中。随后是两个不同的双向变换器。对于基础层的运动标记，指定了一个遮罩变换器，在训练阶段根据文本输入预测随机遮罩的运动标记。在生成（即推断）阶段，从空序列开始，我们的遮罩变换器迭代地填充缺失的标记；随后，一个残差变换器学习逐渐预测基于当前层结果的下一层标记。大量实验证明，MoMask在文本到运动生成任务上优于最先进的方法，HumanML3D数据集上的FID为0.045（例如T2M-GPT的0.141），在KIT-ML数据集上为0.228（0.514）。MoMask还可以无缝应用于相关任务，无需进一步模型微调，例如文本引导的时间内插。

English

We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.

MoMask：生成式遮蔽建模3D人体动作

MoMask: Generative Masked Modeling of 3D Human Motions

摘要

Support