MotionCLR：理解に基づく注目メカニズムを介したモーション生成とトレーニングフリーな編集

要旨

この研究は、人間の動作生成のインタラクティブな編集の問題に焦点を当てています。従来の動作拡散モデルは、単語レベルのテキスト-動作対応の明示的なモデリングや良好な説明可能性が欠けており、そのため微細な編集能力が制限されていました。この問題に対処するために、我々はMotionCLRと呼ばれるアテンションベースの動作拡散モデルを提案します。MotionCLRは、アテンションメカニズムをCLeaRにモデリングすることで、モダリティ内およびモダリティ間の相互作用をそれぞれ自己アテンションと交差アテンションでモデル化しています。具体的には、自己アテンションメカニズムはフレーム間の順序の類似性を測定し、動作特徴の順序に影響を与えます。これに対して、交差アテンションメカニズムは、微細な単語列の対応関係を見つけ、動作シーケンス内の対応するタイムステップを活性化します。これらの主要な特性に基づいて、アテンションマップを操作することで、動作の強調や弱調、その場での動作置換、例に基づいた動作生成など、シンプルで効果的な動作編集手法の多目的なセットを開発しています。さらに、アテンションメカニズムの説明可能性をさらに検証するために、アクションのカウントやアテンションマップを介した基盤となる動作生成能力の可能性を探求しています。実験結果は、我々の手法が優れた生成および編集能力を持ち、良好な説明可能性を享受していることを示しています。

English

This research delves into the problem of interactive editing of human motion generation. Previous motion diffusion models lack explicit modeling of the word-level text-motion correspondence and good explainability, hence restricting their fine-grained editing ability. To address this issue, we propose an attention-based motion diffusion model, namely MotionCLR, with CLeaR modeling of attention mechanisms. Technically, MotionCLR models the in-modality and cross-modality interactions with self-attention and cross-attention, respectively. More specifically, the self-attention mechanism aims to measure the sequential similarity between frames and impacts the order of motion features. By contrast, the cross-attention mechanism works to find the fine-grained word-sequence correspondence and activate the corresponding timesteps in the motion sequence. Based on these key properties, we develop a versatile set of simple yet effective motion editing methods via manipulating attention maps, such as motion (de-)emphasizing, in-place motion replacement, and example-based motion generation, etc. For further verification of the explainability of the attention mechanism, we additionally explore the potential of action-counting and grounded motion generation ability via attention maps. Our experimental results show that our method enjoys good generation and editing ability with good explainability.

MotionCLR：理解に基づく注目メカニズムを介したモーション生成とトレーニングフリーな編集

MotionCLR: Motion Generation and Training-free Editing via Understanding Attention Mechanisms

要旨

Support