MoMask:生成式遮罩建模3D人體動作
MoMask: Generative Masked Modeling of 3D Human Motions
November 29, 2023
作者: Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, Li Cheng
cs.AI
摘要
我們介紹了MoMask,一種用於基於文本驅動的3D人體運動生成的新型遮罩建模框架。在MoMask中,採用了分層量化方案,將人體運動表示為具有高保真度細節的多層離散運動標記。從基礎層開始,通過向量量化獲得的一系列運動標記,派生並存儲在階層的後續層中的增量標記。隨後是兩個不同的雙向變壓器。對於基礎層運動標記,設計了一個遮罩變壓器,在訓練階段預測隨機遮罩的運動標記,並以文本輸入為條件。在生成(即推理)階段,從空序列開始,我們的遮罩變壓器迭代地填補缺失的標記;隨後,一個剩餘變壓器學習根據當前層的結果逐步預測下一層的標記。大量實驗表明,MoMask在文本到運動生成任務上優於最先進的方法,HumanML3D數據集的FID為0.045(例如T2M-GPT的0.141),在KIT-ML上為0.228(0.514)。MoMask還可以無縫應用於相關任務,無需進行進一步的模型微調,例如文本引導的時間修補。
English
We introduce MoMask, a novel masked modeling framework for text-driven 3D
human motion generation. In MoMask, a hierarchical quantization scheme is
employed to represent human motion as multi-layer discrete motion tokens with
high-fidelity details. Starting at the base layer, with a sequence of motion
tokens obtained by vector quantization, the residual tokens of increasing
orders are derived and stored at the subsequent layers of the hierarchy. This
is consequently followed by two distinct bidirectional transformers. For the
base-layer motion tokens, a Masked Transformer is designated to predict
randomly masked motion tokens conditioned on text input at training stage.
During generation (i.e. inference) stage, starting from an empty sequence, our
Masked Transformer iteratively fills up the missing tokens; Subsequently, a
Residual Transformer learns to progressively predict the next-layer tokens
based on the results from current layer. Extensive experiments demonstrate that
MoMask outperforms the state-of-art methods on the text-to-motion generation
task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset,
and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly
applied in related tasks without further model fine-tuning, such as text-guided
temporal inpainting.