MoMask：生成式遮罩建模3D人體動作

摘要

我們介紹了MoMask，一種用於基於文本驅動的3D人體運動生成的新型遮罩建模框架。在MoMask中，採用了分層量化方案，將人體運動表示為具有高保真度細節的多層離散運動標記。從基礎層開始，通過向量量化獲得的一系列運動標記，派生並存儲在階層的後續層中的增量標記。隨後是兩個不同的雙向變壓器。對於基礎層運動標記，設計了一個遮罩變壓器，在訓練階段預測隨機遮罩的運動標記，並以文本輸入為條件。在生成（即推理）階段，從空序列開始，我們的遮罩變壓器迭代地填補缺失的標記；隨後，一個剩餘變壓器學習根據當前層的結果逐步預測下一層的標記。大量實驗表明，MoMask在文本到運動生成任務上優於最先進的方法，HumanML3D數據集的FID為0.045（例如T2M-GPT的0.141），在KIT-ML上為0.228（0.514）。MoMask還可以無縫應用於相關任務，無需進行進一步的模型微調，例如文本引導的時間修補。

English

We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.

MoMask：生成式遮罩建模3D人體動作

MoMask: Generative Masked Modeling of 3D Human Motions

摘要

Support