Motion Mamba: 階層的かつ双方向の選択的SSMによる効率的で長系列のモーション生成

要旨

人間の動作生成は、生成コンピュータビジョンにおける重要な課題であり、長いシーケンスの効率的な動作生成を実現することは依然として困難です。最近の状態空間モデル（SSMs）、特にMambaの進展は、効率的なハードウェア対応設計による長いシーケンスのモデリングにおいて大きな可能性を示しており、これに基づいて動作生成モデルを構築する有望な方向性として注目されています。しかし、SSMsを動作生成に適応させることは、動作シーケンスをモデル化するための専門的な設計アーキテクチャの欠如により困難を伴います。これらの課題に対処するため、我々はMotion Mambaを提案します。これは、SSMsを利用した先駆的な動作生成モデルを提示するシンプルで効率的なアプローチです。具体的には、フレーム間の動作の一貫性を保つために、対称的なU-Netアーキテクチャにわたって異なる数の独立したSSMモジュールをアンサンブルする階層的時間Mamba（HTM）ブロックを設計しました。また、時間フレーム内での正確な動作生成を強化するために、潜在的なポーズを双方向に処理する双方向空間Mamba（BSM）ブロックを設計しました。提案手法は、HumanML3DおよびKIT-MLデータセットにおいて、従来の最良の拡散ベースの手法と比較して最大50%のFID改善と最大4倍の高速化を達成し、高品質な長いシーケンスの動作モデリングとリアルタイムの人間の動作生成の強力な能力を実証しています。プロジェクトウェブサイトはこちらです：https://steve-zeyu-zhang.github.io/MotionMamba/

English

Human motion generation stands as a significant pursuit in generative computer vision, while achieving long-sequence and efficient motion generation remains challenging. Recent advancements in state space models (SSMs), notably Mamba, have showcased considerable promise in long sequence modeling with an efficient hardware-aware design, which appears to be a promising direction to build motion generation model upon it. Nevertheless, adapting SSMs to motion generation faces hurdles since the lack of a specialized design architecture to model motion sequence. To address these challenges, we propose Motion Mamba, a simple and efficient approach that presents the pioneering motion generation model utilized SSMs. Specifically, we design a Hierarchical Temporal Mamba (HTM) block to process temporal data by ensemble varying numbers of isolated SSM modules across a symmetric U-Net architecture aimed at preserving motion consistency between frames. We also design a Bidirectional Spatial Mamba (BSM) block to bidirectionally process latent poses, to enhance accurate motion generation within a temporal frame. Our proposed method achieves up to 50% FID improvement and up to 4 times faster on the HumanML3D and KIT-ML datasets compared to the previous best diffusion-based method, which demonstrates strong capabilities of high-quality long sequence motion modeling and real-time human motion generation. See project website https://steve-zeyu-zhang.github.io/MotionMamba/

Motion Mamba: 階層的かつ双方向の選択的SSMによる効率的で長系列のモーション生成

Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM

要旨

Support