Make-An-Animation: 大規模テキスト条件付き3D人体モーション生成

要旨

テキスト誘導型の人間動作生成は、アニメーションやロボティクスにわたる影響力のある応用分野から、大きな関心を集めています。最近では、拡散モデルを動作生成に適用することで、生成される動作の品質が向上しています。しかし、既存のアプローチは比較的小規模なモーションキャプチャデータに依存しているため、より多様な現実世界のプロンプトに対する性能が低いという課題があります。本論文では、大規模な画像-テキストデータセットから多様なポーズとプロンプトを学習することで、従来の研究を大幅に上回る性能を実現する、テキスト条件付き人間動作生成モデル「Make-An-Animation」を紹介します。Make-An-Animationは2段階で学習されます。まず、画像-テキストデータセットから抽出された（テキスト、静的疑似ポーズ）ペアの大規模データセットで学習を行います。次に、モーションキャプチャデータでファインチューニングを行い、時間次元をモデル化するための追加レイヤーを加えます。従来の動作生成用拡散モデルとは異なり、Make-An-Animationは最近のテキスト-to-ビデオ生成モデルに類似したU-Netアーキテクチャを採用しています。動作のリアリズムと入力テキストとの整合性に関する人間評価では、本モデルがテキスト-to-動作生成において最先端の性能を達成することが示されています。

English

Text-guided human motion generation has drawn significant interest because of its impactful applications spanning animation and robotics. Recently, application of diffusion models for motion generation has enabled improvements in the quality of generated motions. However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on more diverse, in-the-wild prompts. In this paper, we introduce Make-An-Animation, a text-conditioned human motion generation model which learns more diverse poses and prompts from large-scale image-text datasets, enabling significant improvement in performance over prior works. Make-An-Animation is trained in two stages. First, we train on a curated large-scale dataset of (text, static pseudo-pose) pairs extracted from image-text datasets. Second, we fine-tune on motion capture data, adding additional layers to model the temporal dimension. Unlike prior diffusion models for motion generation, Make-An-Animation uses a U-Net architecture similar to recent text-to-video generation models. Human evaluation of motion realism and alignment with input text shows that our model reaches state-of-the-art performance on text-to-motion generation.

Make-An-Animation: 大規模テキスト条件付き3D人体モーション生成

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

要旨

Support