혼합된 위치 인코딩을 통한 자연스러운 인간 동작 합성

초록

조건부 인간 동작 생성은 가상 현실, 게임, 로보틱스 등 다양한 분야에서 중요한 주제로 자리 잡고 있습니다. 기존 연구들은 텍스트, 음악, 또는 장면에 의해 유도된 동작 생성에 초점을 맞추었지만, 이는 주로 짧은 시간 동안 제한된 고립된 동작을 생성하는 데 그쳤습니다. 이에 반해, 우리는 다양한 텍스트 설명 시리즈에 의해 유도된 길고 연속적인 동작 시퀀스의 생성을 다룹니다. 이러한 맥락에서, 우리는 후처리나 불필요한 노이즈 제거 단계 없이도 원활한 인간 동작 구성(Human Motion Composition, HMC)을 생성하는 최초의 확산 기반 모델인 FlowMDM을 소개합니다. 이를 위해, 우리는 절대적 및 상대적 위치 인코딩을 모두 활용하는 Blended Positional Encodings 기법을 도입했습니다. 구체적으로, 절대적 단계에서는 전반적인 동작 일관성을 회복하고, 상대적 단계에서는 부드럽고 현실적인 전환을 구축합니다. 그 결과, 우리는 Babel 및 HumanML3D 데이터셋에서 정확성, 현실성, 그리고 부드러움 측면에서 최첨단 성능을 달성했습니다. FlowMDM은 Pose-Centric Cross-ATtention 덕분에 동작 시퀀스당 단일 설명만으로도 훈련할 수 있으며, 이는 추론 시 다양한 텍스트 설명에 대해 강건한 성능을 보장합니다. 마지막으로, 기존 HMC 메트릭의 한계를 해결하기 위해, 우리는 급격한 전환을 감지하기 위한 두 가지 새로운 메트릭인 Peak Jerk와 Area Under the Jerk를 제안합니다.

English

Conditional human motion generation is an important topic with many applications in virtual reality, gaming, and robotics. While prior works have focused on generating motion guided by text, music, or scenes, these typically result in isolated motions confined to short durations. Instead, we address the generation of long, continuous sequences guided by a series of varying textual descriptions. In this context, we introduce FlowMDM, the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without any postprocessing or redundant denoising steps. For this, we introduce the Blended Positional Encodings, a technique that leverages both absolute and relative positional encodings in the denoising chain. More specifically, global motion coherence is recovered at the absolute stage, whereas smooth and realistic transitions are built at the relative stage. As a result, we achieve state-of-the-art results in terms of accuracy, realism, and smoothness on the Babel and HumanML3D datasets. FlowMDM excels when trained with only a single description per motion sequence thanks to its Pose-Centric Cross-ATtention, which makes it robust against varying text descriptions at inference time. Finally, to address the limitations of existing HMC metrics, we propose two new metrics: the Peak Jerk and the Area Under the Jerk, to detect abrupt transitions.

혼합된 위치 인코딩을 통한 자연스러운 인간 동작 합성

Seamless Human Motion Composition with Blended Positional Encodings

초록

Support