MotionGPT：ファインチューニングされたLLMは汎用モーションジェネレータである

要旨

与えられた動作記述から現実的な人間の動きを生成することは、デジタルヒューマンの需要の高まりにより、大きな進展を遂げてきた。最近の研究では、テキストによる動作記述から直接動きを生成することにおいて印象的な成果を上げているが、これらの手法はしばしば単一の制御信号モダリティしかサポートしておらず、実際のデジタルヒューマン産業における応用が制限されている。本論文では、マルチモーダルな制御信号（例えば、テキストや単一フレームのポーズ）を大規模言語モデル（LLM）の特殊な入力トークンとして扱い、連続的な人間の動きを生成するMotion General-Purpose generaTor（MotionGPT）を提案する。具体的には、まずマルチモーダルな制御信号を離散コードに量子化し、それを統一されたプロンプト指示として定式化し、LLMに動きの回答を生成させる。我々のMotionGPTは、LLMのパラメータのわずか0.4%をチューニングすることで、マルチモーダルな制御信号を用いた統一的な人間の動き生成モデルを実現する。我々の知る限り、MotionGPTはマルチモーダルな制御信号を用いて人間の動きを生成する初めての手法であり、この新しい方向性に光を当てることを期待している。コードは受理後に公開される予定である。

English

Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in large language models (LLMs). Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Codes shall be released upon acceptance.

MotionGPT：ファインチューニングされたLLMは汎用モーションジェネレータである

MotionGPT: Finetuned LLMs are General-Purpose Motion Generators

要旨

Support