利用混合位置编码实现无缝人体动作合成

摘要

有条件的人体动作生成是一个重要课题，在虚拟现实、游戏和机器人领域有许多应用。尽管先前的研究侧重于通过文本、音乐或场景生成引导的动作，但通常导致短时段内的孤立动作。相反，我们致力于生成由一系列不同文本描述引导的长、连续序列。在这种情况下，我们介绍了FlowMDM，这是第一个基于扩散的模型，可以生成无需任何后处理或冗余去噪步骤的无缝人体动作组合（HMC）。为此，我们引入了混合位置编码，这是一种利用绝对和相对位置编码的技术，用于去噪链中。更具体地说，全局动作一致性在绝对阶段得以恢复，而在相对阶段构建了平滑且逼真的过渡。因此，我们在 Babel 和 HumanML3D 数据集上在准确性、逼真度和平滑度方面取得了最先进的结果。FlowMDM 在训练时每个动作序列仅使用单个描述时表现出色，这要归功于其姿势中心交叉注意力机制，使其在推断时对不同文本描述具有鲁棒性。最后，为了解决现有 HMC 指标的局限性，我们提出了两个新指标：峰值加速度变化率和加速度变化率曲线下的面积，用于检测突变过渡。

English

Conditional human motion generation is an important topic with many applications in virtual reality, gaming, and robotics. While prior works have focused on generating motion guided by text, music, or scenes, these typically result in isolated motions confined to short durations. Instead, we address the generation of long, continuous sequences guided by a series of varying textual descriptions. In this context, we introduce FlowMDM, the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without any postprocessing or redundant denoising steps. For this, we introduce the Blended Positional Encodings, a technique that leverages both absolute and relative positional encodings in the denoising chain. More specifically, global motion coherence is recovered at the absolute stage, whereas smooth and realistic transitions are built at the relative stage. As a result, we achieve state-of-the-art results in terms of accuracy, realism, and smoothness on the Babel and HumanML3D datasets. FlowMDM excels when trained with only a single description per motion sequence thanks to its Pose-Centric Cross-ATtention, which makes it robust against varying text descriptions at inference time. Finally, to address the limitations of existing HMC metrics, we propose two new metrics: the Peak Jerk and the Area Under the Jerk, to detect abrupt transitions.

利用混合位置编码实现无缝人体动作合成

Seamless Human Motion Composition with Blended Positional Encodings

摘要

Support