MotionGPT：微調的LLM是通用運動生成器

摘要

由於數位人類的新興需求，從給定動作描述中生成逼真的人類動作已經取得顯著進展。儘管最近的研究在直接從文本動作描述中生成動作方面取得了令人印象深刻的成果，但它們通常僅支持控制信號的單一模態，這限制了它們在真實數位人類產業中的應用。本文提出了一種名為運動通用生成器（MotionGPT）的方法，可以使用多模態控制信號（例如文本和單幀姿勢）來生成連續的人類動作，將多模態信號視為大型語言模型（LLMs）中的特殊輸入標記。具體而言，我們首先將多模態控制信號量化為離散代碼，然後將它們制定為統一的提示指令，要求LLMs生成動作答案。我們的MotionGPT通過調整僅佔LLM參數總量的0.4％，展示了一個具有多模態控制信號的統一人類運動生成模型。據我們所知，MotionGPT是第一種通過多模態控制信號生成人類動作的方法，我們希望這將為這個新方向帶來新的啟發。代碼將在接受後發布。

English

Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in large language models (LLMs). Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Codes shall be released upon acceptance.

MotionGPT：微調的LLM是通用運動生成器

MotionGPT: Finetuned LLMs are General-Purpose Motion Generators

摘要

Support