MotionGPT：微调的LLM是通用运动生成器

摘要

由于数字人类的新兴需求，从给定动作描述生成逼真的人类运动已经取得了显著进展。尽管最近的研究在直接从文本动作描述生成运动方面取得了令人印象深刻的成果，但它们通常仅支持控制信号的单一模态，这限制了它们在真实数字人类行业中的应用。本文提出了一种运动通用生成器（MotionGPT），它可以利用多模态控制信号（例如文本和单帧姿势）来生成连续的人类运动，通过将多模态信号视为大型语言模型（LLMs）中的特殊输入标记。具体来说，我们首先将多模态控制信号量化为离散代码，然后将它们构建成统一的提示指令，要求LLMs生成运动答案。我们的MotionGPT通过调整仅占LLM参数的0.4%展示了一个具有多模态控制信号的统一人类运动生成模型。据我们所知，MotionGPT是第一种通过多模态控制信号生成人类运动的方法，我们希望这能为这个新方向带来启示。代码将在接受后发布。

English

Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in large language models (LLMs). Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Codes shall be released upon acceptance.

MotionGPT：微调的LLM是通用运动生成器

MotionGPT: Finetuned LLMs are General-Purpose Motion Generators

摘要

Support