MoMask: Modellazione Generativa con Mascheramento dei Movimenti Umani 3D

Abstract

Presentiamo MoMask, un innovativo framework di modellazione mascherata per la generazione di movimenti umani 3D guidati da testo. In MoMask, viene impiegato uno schema di quantizzazione gerarchica per rappresentare il movimento umano come token di movimento multistrato con dettagli ad alta fedeltà. Partendo dal livello base, con una sequenza di token di movimento ottenuti mediante quantizzazione vettoriale, i token residui di ordine crescente vengono derivati e memorizzati nei livelli successivi della gerarchia. Questo processo è seguito da due distinti transformer bidirezionali. Per i token di movimento del livello base, un Masked Transformer è designato a prevedere i token di movimento mascherati casualmente condizionati dall'input testuale durante la fase di addestramento. Durante la fase di generazione (cioè inferenza), partendo da una sequenza vuota, il nostro Masked Transformer riempie iterativamente i token mancanti; successivamente, un Residual Transformer impara a prevedere progressivamente i token del livello successivo basandosi sui risultati del livello corrente. Esperimenti estensivi dimostrano che MoMask supera i metodi all'avanguardia nel compito di generazione testo-movimento, con un FID di 0.045 (rispetto a 0.141 di T2M-GPT) sul dataset HumanML3D e 0.228 (rispetto a 0.514) su KIT-ML, rispettivamente. MoMask può anche essere applicato senza soluzione di continuità in compiti correlati senza ulteriore fine-tuning del modello, come l'inpainting temporale guidato da testo.

English

We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.

MoMask: Modellazione Generativa con Mascheramento dei Movimenti Umani 3D

MoMask: Generative Masked Modeling of 3D Human Motions

Abstract

Support