拡散ベースの離散運動トークナイザーによる意味的・運動学的条件の統合

要旨

従来のモーション生成は主に2つのパラダイムに従ってきた。キネマティック制御に優れた連続拡散モデルと、意味的制御に効果的な離散トークンベースの生成器である。両者の長所を統合するため、我々は条件特徴抽出（知覚）、離散トークン生成（計画）、拡散ベースのモーション合成（制御）の3段階から成るフレームワークを提案する。この枠組みの中核を成すのがMoTokである。これは拡散ベースの離散モーショントークナイザーであり、モーション復元を拡散デコーダーに委譲することで意味的抽象化と細粒度再構成を分離する。これにより、モーションの忠実性を保ちつつ、コンパクトな単層トークンを実現する。キネマティック条件については、粗い制約は計画段階でのトークン生成を誘導し、細粒度の制約は制御段階で拡散ベースの最適化により強制される。この設計により、キネマティック詳細が意味的トークン計画を妨げるのを防ぐ。HumanML3Dにおける実験では、本手法はMaskControlと比べてトークン数を6分の1に削減しながらも制御性と忠実性を大幅に向上させ、軌道誤差を0.72cmから0.08cmへ、FIDを0.083から0.029へ改善した。従来手法が強いキネマティック制約下で品質劣化するのとは異なり、本手法は忠実性を向上させ、FIDを0.033から0.014に低減した。

English

Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.

拡散ベースの離散運動トークナイザーによる意味的・運動学的条件の統合

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

要旨

Support