基于扩散模型的离散运动标记器：连接语义与运动学条件

摘要

先前运动生成主要遵循两种范式：擅长运动学控制的连续扩散模型，以及适用于语义条件建模的离散令牌生成器。为融合二者优势，我们提出包含条件特征提取（感知）、离散令牌生成（规划）与基于扩散的运动合成（控制）的三阶段框架。该框架的核心是MoTok——一种基于扩散的离散运动分词器，通过将运动重建任务委托给扩散解码器，实现语义抽象与细粒度重构的解耦，从而在保持运动保真度的同时实现紧凑的单层令牌表示。针对运动学条件，粗粒度约束在规划阶段指导令牌生成，而细粒度约束则通过基于扩散的优化在控制阶段实施。这种设计有效防止运动学细节干扰语义令牌规划。在HumanML3D数据集上，本方法仅使用六分之一令牌量即显著提升MaskControl的可控性与保真度，轨迹误差从0.72厘米降至0.08厘米，FID从0.083改善至0.029。与现有方法在强运动学约束下性能衰退不同，本方法反而提升保真度，将FID从0.033进一步降至0.014。

English

Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.