Causale Bewegingsdiffusiemodellen voor Autoregressieve Bewegingsgeneratie

Samenvatting

Recente vooruitgang in bewegingsdiffusiemodellen heeft de realiteitswaarde van menselijke bewegingssynthese aanzienlijk verbeterd. Bestaande benaderingen zijn echter ofwel gebaseerd op diffusiemodellen voor volledige sequenties met bidirectionele generatie, wat temporele causaliteit en realtime-toepasbaarheid beperkt, ofwel op autoregressieve modellen die lijden onder instabiliteit en cumulatieve fouten. In dit werk presenteren we Causale Bewegingsdiffusiemodellen (CMDM), een uniform raamwerk voor autoregressieve bewegingsgeneratie gebaseerd op een causale diffusietransformer die opereert in een semantisch uitgelijnde latente ruimte. CMDM bouwt voort op een Beweging-Taal-Uitgelijnde Causale VAE (MAC-VAE), die bewegingssequenties codeert in temporeel causale latente representaties. Bovenop deze latente representatie wordt een autoregressieve diffusietransformer getraind met causale diffusie-forcering om temporeel geordende denoisering over bewegingsframes uit te voeren. Voor snelle inferentie introduceren we een framegewijze bemonsteringsplanning met causale onzekerheid, waarbij elk volgend frame wordt voorspeld uit gedeeltelijk gedenoiseerde vorige frames. Het resulterende raamwerk ondersteunt hoogwaardige tekst-naar-beweging-generatie, streamsynthese en bewegingsgeneratie op lange termijn op interactieve snelheden. Experimenten op HumanML3D en SnapMoGen tonen aan dat CMDM bestaande diffusie- en autoregressieve modellen overtreft in zowel semantische trouw als temporele vloeiendheid, terwijl de inferentielatentie aanzienlijk wordt verminderd.

English

Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.

Causale Bewegingsdiffusiemodellen voor Autoregressieve Bewegingsgeneratie

Causal Motion Diffusion Models for Autoregressive Motion Generation

Samenvatting

Support