DrawMotion: フリーハンド描画による3D人体動作の生成

要旨

テキスト記述を人間の動作に変換するテキスト・トゥ・モーション生成では、ユーザーが意図した動作をテキストのみで正確に表現することが難しいという課題がある。この問題に対処するため、本論文ではマルチ条件シナリオ向けの効率的な拡散型フレームワーク「DrawMotion」を提案する。DrawMotionは、従来のテキスト条件に加えて、新たに導入した手描き条件に基づいて動作を生成する。これら2つの条件は、それぞれ生成される動作に対する意味的な制御と空間的な制御を提供する。具体的には、細粒度の動作生成タスクに以下の3つの観点から取り組む。1) フリーハンド描画条件：ユーザーが煩雑なテキスト入力を必要とせずに意図した動作を正確に捉えられるよう、異なるデータセット形式に対応した手描きスティックマンスケッチを自動生成するアルゴリズムを開発する。2) マルチ条件の融合：拡散過程に統合可能なマルチ条件モジュール (MCM) を提案する。これにより、従来手法と比較して計算複雑性を低減しつつ、モデルが可能な条件の組み合わせすべてを活用できるようになる。3) 学習不要のガイダンス：特筆すべき点として、DrawMotionのMCMは中間特徴量が連続空間に存在することを保証するため、分類器ガイダンスの勾配によって特徴量を更新し、生成動作をユーザーの意図に合わせると同時に忠実性を維持できる。定量的実験とユーザー評価により、フリーハンド描画アプローチはユーザーが想像通りの動作を生成する際の時間を約46.7%削減することを示した。コード、デモ、関連データはhttps://github.com/InvertedForest/DrawMotionで公開されている。

English

Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users' intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at https://github.com/InvertedForest/DrawMotion.