ReMoMask: Opzoekingsondersteunde Gemaskeerde Bewegingsgeneratie

Samenvatting

Text-to-Motion (T2M) generatie heeft als doel realistische en semantisch afgestemde menselijke bewegingssequenties te synthetiseren op basis van natuurlijke taal beschrijvingen. Huidige benaderingen worden echter geconfronteerd met dubbele uitdagingen: generatieve modellen (bijvoorbeeld diffusiemodellen) lijden onder beperkte diversiteit, foutaccumulatie en fysieke onwaarschijnlijkheid, terwijl Retrieval-Augmented Generation (RAG) methoden last hebben van diffusie-inertie, gedeeltelijke mode-collaps en asynchrone artefacten. Om deze beperkingen aan te pakken, stellen we ReMoMask voor, een uniform raamwerk dat drie belangrijke innovaties integreert: 1) Een Bidirectioneel Momentum Text-Motion Model ontkoppelt de schaal van negatieve voorbeelden van de batchgrootte via momentum-wachtrijen, wat de precisie van cross-modale retrievals aanzienlijk verbetert; 2) Een Semantisch Spatio-temporeel Attention mechanisme dwingt biomechanische beperkingen af tijdens fusie op deel-niveau om asynchrone artefacten te elimineren; 3) RAG-Classier-Free Guidance incorporeert minimale onvoorwaardelijke generatie om de generalisatie te verbeteren. Gebouwd op MoMask's RVQ-VAE, genereert ReMoMask efficiënt temporeel coherente bewegingen in minimale stappen. Uitgebreide experimenten op standaard benchmarks demonstreren de state-of-the-art prestaties van ReMoMask, met een verbetering van 3,88% en 10,97% in FID-scores op respectievelijk HumanML3D en KIT-ML, vergeleken met de vorige SOTA-methode RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.

English

Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.

ReMoMask: Opzoekingsondersteunde Gemaskeerde Bewegingsgeneratie

ReMoMask: Retrieval-Augmented Masked Motion Generation

Samenvatting

Support