Motion Mamba: 계층적 및 양방향 선택적 SSM을 통한 효율적이고 긴 시퀀스 모션 생성

초록

인간 동작 생성은 생성적 컴퓨터 비전 분야에서 중요한 과제로 자리 잡고 있으며, 긴 시퀀스와 효율적인 동작 생성을 달성하는 것은 여전히 도전적인 과제로 남아 있습니다. 최근 상태 공간 모델(SSMs), 특히 Mamba의 발전은 효율적인 하드웨어 인식 설계를 통해 긴 시퀀스 모델링에서 상당한 가능성을 보여주었으며, 이는 동작 생성 모델을 구축하기 위한 유망한 방향으로 보입니다. 그러나 SSMs를 동작 생성에 적용하는 것은 동작 시퀀스를 모델링하기 위한 전문화된 설계 아키텍처의 부재로 인해 어려움에 직면해 있습니다. 이러한 문제를 해결하기 위해, 우리는 Motion Mamba를 제안합니다. 이는 SSMs를 활용한 선구적인 동작 생성 모델을 제시하는 간단하고 효율적인 접근 방식입니다. 구체적으로, 우리는 프레임 간의 동작 일관성을 유지하기 위해 대칭적인 U-Net 아키텍처를 통해 다양한 수의 독립적인 SSM 모듈을 앙상블하여 시간적 데이터를 처리하는 Hierarchical Temporal Mamba(HTM) 블록을 설계했습니다. 또한, 시간적 프레임 내에서 정확한 동작 생성을 강화하기 위해 잠재 포즈를 양방향으로 처리하는 Bidirectional Spatial Mamba(BSM) 블록을 설계했습니다. 우리가 제안한 방법은 이전의 최고의 확산 기반 방법과 비교하여 HumanML3D 및 KIT-ML 데이터셋에서 최대 50%의 FID 개선과 최대 4배 빠른 속도를 달성하며, 고품질의 긴 시퀀스 동작 모델링과 실시간 인간 동작 생성의 강력한 능력을 입증했습니다. 프로젝트 웹사이트를 참조하십시오: https://steve-zeyu-zhang.github.io/MotionMamba/

English

Human motion generation stands as a significant pursuit in generative computer vision, while achieving long-sequence and efficient motion generation remains challenging. Recent advancements in state space models (SSMs), notably Mamba, have showcased considerable promise in long sequence modeling with an efficient hardware-aware design, which appears to be a promising direction to build motion generation model upon it. Nevertheless, adapting SSMs to motion generation faces hurdles since the lack of a specialized design architecture to model motion sequence. To address these challenges, we propose Motion Mamba, a simple and efficient approach that presents the pioneering motion generation model utilized SSMs. Specifically, we design a Hierarchical Temporal Mamba (HTM) block to process temporal data by ensemble varying numbers of isolated SSM modules across a symmetric U-Net architecture aimed at preserving motion consistency between frames. We also design a Bidirectional Spatial Mamba (BSM) block to bidirectionally process latent poses, to enhance accurate motion generation within a temporal frame. Our proposed method achieves up to 50% FID improvement and up to 4 times faster on the HumanML3D and KIT-ML datasets compared to the previous best diffusion-based method, which demonstrates strong capabilities of high-quality long sequence motion modeling and real-time human motion generation. See project website https://steve-zeyu-zhang.github.io/MotionMamba/

Motion Mamba: 계층적 및 양방향 선택적 SSM을 통한 효율적이고 긴 시퀀스 모션 생성

Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM

초록

Support