基於擴散模型的離散運動標記器：語義與運動學條件的橋接

摘要

過往的動作生成研究主要遵循兩種範式：擅長運動學控制的連續擴散模型，以及適用於語義調控的離散符號生成器。為融合兩者優勢，我們提出包含條件特徵提取（感知）、離散符號生成（規劃）與擴散式動作合成（控制）的三階段框架。該框架的核心是MoTok——一種基於擴散的離散動作符號化器，通過將動作重建任務委派給擴散解碼器，實現語義抽象與細粒度重建的解耦，從而能在保持動作保真度的同時使用緊湊的單層符號。針對運動學條件，粗粒度約束在規劃階段指導符號生成，而細粒度約束則通過擴散優化在控制階段實施。此設計可防止運動學細節干擾語義符號規劃。在HumanML3D數據集上，本方法僅使用六分之一符號量即顯著提升MaskControl的可控性與保真度，軌跡誤差從0.72厘米降至0.08厘米，FID從0.083改善至0.029。有別於先前方法在強運動學約束下性能衰減的現象，本方法反而提升保真度，將FID從0.033進一步降至0.014。

English

Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.

基於擴散模型的離散運動標記器：語義與運動學條件的橋接

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

摘要

Support