ChatGPT로 정제된 설명을 활용한 세분화된 인간 동작 생성

초록

최근 텍스트 기반 모션 생성 분야에서 상당한 진전이 이루어져, 텍스트 설명에 부합하는 다양하고 고품질의 인간 모션을 생성할 수 있게 되었다. 그러나 상세한 텍스트 설명으로 주석이 달린 데이터셋의 부족으로 인해 세분화되거나 스타일화된 모션을 생성하는 것은 여전히 어려운 과제로 남아 있다. 분할 정복 전략을 채택하여, 우리는 인간 모션 생성을 위한 새로운 프레임워크인 세분화된 인간 모션 확산 모델(Fine-Grained Human Motion Diffusion Model, FG-MDM)을 제안한다. 구체적으로, 우리는 먼저 대규모 언어 모델(GPT-3.5)을 활용하여 이전의 모호한 텍스트 주석을 신체 부위별로 세분화된 설명으로 파싱한다. 그런 다음 이러한 세분화된 설명을 사용하여 트랜스포머 기반 확산 모델을 안내한다. FG-MDM은 훈련 데이터 분포를 벗어난 상황에서도 세분화되고 스타일화된 모션을 생성할 수 있다. 우리의 실험 결과는 FG-MDM이 이전 방법들에 비해 우수함을 보여주며, 특히 강력한 일반화 능력을 입증한다. 우리는 HumanML3D와 KIT에 대한 세분화된 텍스트 주석을 공개할 예정이다.

English

Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, it remains challenging to generate fine-grained or stylized motions due to the lack of datasets annotated with detailed textual descriptions. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for human motion generation. Specifically, we first parse previous vague textual annotation into fine-grained description of different body parts by leveraging a large language model (GPT-3.5). We then use these fine-grained descriptions to guide a transformer-based diffusion model. FG-MDM can generate fine-grained and stylized motions even outside of the distribution of the training data. Our experimental results demonstrate the superiority of FG-MDM over previous methods, especially the strong generalization capability. We will release our fine-grained textual annotations for HumanML3D and KIT.

ChatGPT로 정제된 설명을 활용한 세분화된 인간 동작 생성

Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions

초록

Support