因果的モーション拡散モデルによる自己回帰的モーション生成

要旨

近年、モーション拡散モデルの進展により、人間の動作合成のリアリズムが大幅に向上している。しかし、既存の手法は、時間的因果性とリアルタイム適用性を制限する双方向生成に依存するフルシーケンス拡散モデルか、不安定性と累積誤差に悩む自己回帰モデルのいずれかに依存している。本研究では、意味的に整列された潜在空間で動作する因果的拡散トランスフォーマーに基づく、自己回帰的モーション生成のための統一フレームワークであるCausal Motion Diffusion Models（CMDM）を提案する。CMDMは、モーションシーケンスを時間的に因果関係のある潜在表現に符号化するMotion-Language-Aligned Causal VAE（MAC-VAE）を基盤としている。この潜在表現の上で、自己回帰的拡散トランスフォーマーが因果的拡散強制を用いて訓練され、モーションフレーム間で時間順にデノイズを行う。高速な推論を実現するため、因果的不確実性を伴うフレーム単位のサンプリングスケジュールを導入し、後続の各フレームが部分的にデノイズされた前フレームから予測される。結果として得られるフレームワークは、高品質なテキストからモーションへの生成、ストリーミング合成、およびインタラクティブなレートでの長期的なモーション生成をサポートする。HumanML3DおよびSnapMoGenでの実験により、CMDMが意味的忠実度と時間的滑らかさの両方において既存の拡散モデルおよび自己回帰モデルを凌駕し、推論遅延を大幅に削減することを実証した。

English

Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.

因果的モーション拡散モデルによる自己回帰的モーション生成

Causal Motion Diffusion Models for Autoregressive Motion Generation

要旨

Support