MotionGPT：人类动作如同外语

摘要

随着预训练的大型语言模型的进展，构建一个统一的模型用于语言和其他多模态数据（如动作）的探索仍然具有挑战性且迄今未被触及。幸运的是，人类动作展现出一种类似于人类语言的语义耦合，通常被视为一种身体语言形式。通过将语言数据与大规模动作模型融合，可以实现能够增强与动作相关任务性能的动作语言预训练。基于这一观点，我们提出了MotionGPT，这是一个统一、多功能且用户友好的动作语言模型，用于处理多个与动作相关的任务。具体来说，我们采用离散向量量化来处理人类动作，并将3D动作转换为动作标记，类似于单词标记的生成过程。在这个“动作词汇”的基础上，我们以统一的方式对动作和文本进行语言建模，将人类动作视为一种特定的语言。此外，受提示学习的启发，我们使用动作语言数据的混合进行MotionGPT的预训练，并在基于提示的问答任务上进行微调。大量实验证明，MotionGPT在包括文本驱动动作生成、动作字幕、动作预测和动作插值在内的多个动作任务上取得了最先进的性能。

English

Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT：人类动作如同外语

MotionGPT: Human Motion as a Foreign Language

摘要

Support