ブレンドされた位置エンコーディングを用いたシームレスな人間のモーション合成

要旨

条件付き人間動作生成は、仮想現実、ゲーム、ロボット工学など多くの応用が可能な重要なトピックです。これまでの研究では、テキスト、音楽、シーンに基づいて動作を生成することに焦点が当てられてきましたが、これらは通常、短時間に限定された孤立した動作に留まっていました。本論文では、一連の変化するテキスト記述に基づいて、長く連続的なシーケンスを生成する問題に取り組みます。この文脈において、我々はFlowMDMを提案します。これは、後処理や冗長なノイズ除去ステップを必要とせずに、シームレスな人間動作合成（HMC）を生成する初の拡散モデルです。このために、我々はBlended Positional Encodingsという技術を導入しました。これは、ノイズ除去チェーンにおいて絶対的位置エンコーディングと相対的位置エンコーディングの両方を活用するものです。具体的には、絶対段階でグローバルな動作の一貫性を回復し、相対段階で滑らかで現実的な遷移を構築します。その結果、BabelおよびHumanML3Dデータセットにおいて、精度、リアリズム、滑らかさの点で最先端の結果を達成しました。FlowMDMは、Pose-Centric Cross-ATtentionのおかげで、各動作シーケンスに単一の記述のみで訓練された場合でも優れた性能を発揮し、推論時に変化するテキスト記述に対してロバストです。最後に、既存のHMCメトリクスの限界に対処するために、急激な遷移を検出するための2つの新しいメトリクス、Peak JerkとArea Under the Jerkを提案します。

English

Conditional human motion generation is an important topic with many applications in virtual reality, gaming, and robotics. While prior works have focused on generating motion guided by text, music, or scenes, these typically result in isolated motions confined to short durations. Instead, we address the generation of long, continuous sequences guided by a series of varying textual descriptions. In this context, we introduce FlowMDM, the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without any postprocessing or redundant denoising steps. For this, we introduce the Blended Positional Encodings, a technique that leverages both absolute and relative positional encodings in the denoising chain. More specifically, global motion coherence is recovered at the absolute stage, whereas smooth and realistic transitions are built at the relative stage. As a result, we achieve state-of-the-art results in terms of accuracy, realism, and smoothness on the Babel and HumanML3D datasets. FlowMDM excels when trained with only a single description per motion sequence thanks to its Pose-Centric Cross-ATtention, which makes it robust against varying text descriptions at inference time. Finally, to address the limitations of existing HMC metrics, we propose two new metrics: the Peak Jerk and the Area Under the Jerk, to detect abrupt transitions.

ブレンドされた位置エンコーディングを用いたシームレスな人間のモーション合成

Seamless Human Motion Composition with Blended Positional Encodings

要旨

Support