利用混合位置編碼實現無縫人體動作合成

摘要

條件式人體動作生成是一個重要的主題，在虛擬現實、遊戲和機器人領域有許多應用。雖然先前的研究集中於根據文本、音樂或場景生成動作，但通常結果是短暫的孤立動作。相反地，我們致力於生成由一系列不同的文本描述引導的長時間連續序列。在這個背景下，我們介紹了FlowMDM，這是第一個基於擴散的模型，可以生成無縫的人體運動組合（HMC），而無需任何後處理或多餘的去噪步驟。為此，我們引入了混合位置編碼，這是一種利用絕對和相對位置編碼的技術，用於去噪鏈中。更具體地說，全局運動一致性在絕對階段恢復，而平滑且逼真的過渡在相對階段建立。因此，我們在 Babel 和 HumanML3D 數據集上在準確性、逼真度和平滑度方面取得了最先進的結果。FlowMDM 在訓練時每個運動序列僅使用單一描述時表現出色，這要歸功於其姿勢中心交叉注意力，使其在推理時對不同的文本描述具有強大的魯棒性。最後，為了解決現有 HMC 指標的限制，我們提出了兩個新的指標：峰值加速度變化率和峰值加速度變化率下的面積，用於檢測突然的過渡。

English

Conditional human motion generation is an important topic with many applications in virtual reality, gaming, and robotics. While prior works have focused on generating motion guided by text, music, or scenes, these typically result in isolated motions confined to short durations. Instead, we address the generation of long, continuous sequences guided by a series of varying textual descriptions. In this context, we introduce FlowMDM, the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without any postprocessing or redundant denoising steps. For this, we introduce the Blended Positional Encodings, a technique that leverages both absolute and relative positional encodings in the denoising chain. More specifically, global motion coherence is recovered at the absolute stage, whereas smooth and realistic transitions are built at the relative stage. As a result, we achieve state-of-the-art results in terms of accuracy, realism, and smoothness on the Babel and HumanML3D datasets. FlowMDM excels when trained with only a single description per motion sequence thanks to its Pose-Centric Cross-ATtention, which makes it robust against varying text descriptions at inference time. Finally, to address the limitations of existing HMC metrics, we propose two new metrics: the Peak Jerk and the Area Under the Jerk, to detect abrupt transitions.

利用混合位置編碼實現無縫人體動作合成

Seamless Human Motion Composition with Blended Positional Encodings

摘要

Support