製作動畫：大規模文本條件下的3D人體動作生成

摘要

基於文本引導的人體動作生成引起了廣泛關注，因為它在動畫和機器人技術等領域具有重要應用。最近，擴散模型在動作生成中的應用使生成動作的質量得到了提升。然而，現有方法受制於對相對較小規模的運動捕捉數據的依賴，導致在更多樣化的自然環境中表現不佳。本文介紹了一種名為Make-An-Animation的文本條件人體動作生成模型，該模型從大規模圖像文本數據集中學習更多樣化的姿勢和提示，從而在性能上顯著優於先前的工作。Make-An-Animation訓練分為兩個階段。首先，我們在從圖像文本數據集中提取的（文本，靜態虛擬姿勢）對的精選大規模數據集上進行訓練。其次，我們在運動捕捉數據上進行微調，添加額外的層來建模時間維度。與先前用於運動生成的擴散模型不同，Make-An-Animation使用類似於最近的文本到視頻生成模型的U-Net架構。對動作真實性和與輸入文本的對齊的人類評估表明，我們的模型在文本到動作生成方面達到了最先進的性能水平。

English

Text-guided human motion generation has drawn significant interest because of its impactful applications spanning animation and robotics. Recently, application of diffusion models for motion generation has enabled improvements in the quality of generated motions. However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on more diverse, in-the-wild prompts. In this paper, we introduce Make-An-Animation, a text-conditioned human motion generation model which learns more diverse poses and prompts from large-scale image-text datasets, enabling significant improvement in performance over prior works. Make-An-Animation is trained in two stages. First, we train on a curated large-scale dataset of (text, static pseudo-pose) pairs extracted from image-text datasets. Second, we fine-tune on motion capture data, adding additional layers to model the temporal dimension. Unlike prior diffusion models for motion generation, Make-An-Animation uses a U-Net architecture similar to recent text-to-video generation models. Human evaluation of motion realism and alignment with input text shows that our model reaches state-of-the-art performance on text-to-motion generation.

製作動畫：大規模文本條件下的3D人體動作生成

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

摘要

Support