制作动画：大规模文本条件下的3D人体运动生成

摘要

基于文本引导的人体运动生成引起了广泛关注，因为它在动画和机器人技术领域具有重要应用。最近，扩散模型在运动生成中的应用使生成的动作质量得以提高。然而，现有方法受制于对相对规模较小的运动捕捉数据的依赖，导致在更多样化的、真实环境中的提示上表现不佳。在本文中，我们介绍了Make-An-Animation，这是一个文本条件的人体运动生成模型，它能够从大规模图像文本数据集中学习更多样化的姿势和提示，从而显著提高了性能，超越了先前的工作。Make-An-Animation分为两个阶段进行训练。首先，我们在一个精心策划的大规模数据集上进行训练，该数据集由从图像文本数据集中提取的（文本，静态伪姿势）对组成。其次，我们在运动捕捉数据上进行微调，添加额外的层来建模时间维度。与先前用于运动生成的扩散模型不同，Make-An-Animation采用了类似于最近的文本到视频生成模型的U-Net架构。人类对运动逼真度和与输入文本的对齐性的评估显示，我们的模型在文本到运动生成上达到了最先进的性能水平。

English

Text-guided human motion generation has drawn significant interest because of its impactful applications spanning animation and robotics. Recently, application of diffusion models for motion generation has enabled improvements in the quality of generated motions. However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on more diverse, in-the-wild prompts. In this paper, we introduce Make-An-Animation, a text-conditioned human motion generation model which learns more diverse poses and prompts from large-scale image-text datasets, enabling significant improvement in performance over prior works. Make-An-Animation is trained in two stages. First, we train on a curated large-scale dataset of (text, static pseudo-pose) pairs extracted from image-text datasets. Second, we fine-tune on motion capture data, adding additional layers to model the temporal dimension. Unlike prior diffusion models for motion generation, Make-An-Animation uses a U-Net architecture similar to recent text-to-video generation models. Human evaluation of motion realism and alignment with input text shows that our model reaches state-of-the-art performance on text-to-motion generation.

制作动画：大规模文本条件下的3D人体运动生成

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

摘要

Support