MotionGPT：將人類動作視為外語

摘要

隨著預訓練大型語言模型的進展，建立一個統一的模型，用於語言和其他多模態數據（如動作）的探索，至今仍具有挑戰性且未被觸及。幸運的是，人類動作展現出一種類似於人類語言的語義耦合，常被視為一種身體語言形式。通過將語言數據與大規模動作模型融合，可以實現能夠增強與動作相關任務表現的動作-語言預訓練。基於這一見解，我們提出了MotionGPT，一個統一、多功能且用戶友好的動作-語言模型，用於處理多個與動作相關的任務。具體來說，我們採用離散向量量化來處理人類動作，將3D動作轉換為動作標記，類似於單詞標記的生成過程。基於這個“動作詞彙”，我們以統一的方式對動作和文本進行語言建模，將人類動作視為一種特定語言。此外，受提示學習的啟發，我們使用動作-語言數據的混合來預訓練MotionGPT，並在基於提示的問答任務上進行微調。大量實驗表明，MotionGPT在多個動作任務上取得了最先進的表現，包括基於文本的動作生成、動作字幕生成、動作預測和動作中間插值。

English

Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT：將人類動作視為外語

MotionGPT: Human Motion as a Foreign Language

摘要

Support