MotionVLA: Visie-Taal-Actiemodel voor Humanoid Beweging

Samenvatting

Het genereren van realistische mensachtige bewegingen uit scènebeelden en tekst omvat zowel laagfrequente pose-semantiek als hoogfrequente fysieke dynamica. Veel bestaande methoden tokeniseren beweging echter met een enkele gedeelde codeboek, waardoor heterogene bewegingssignalen in dezelfde kwantiseringsruimte worden gedwongen. Onze frequentiedomeinanalyse van menselijke bewegingsdata laat een duidelijke mismatch zien tussen kwantisatie met één codeboek en bewegingsstatistieken: vijf DCT-coëfficiënten vangen 93% van de gewrichtspositie-energie, maar slechts 37% van de gewrichtssnelheidsenergie, wat de kwantisatie kan vertekenen naar pose-statistieken en hoogfrequente snelheidscomponenten ondervertegenwoordigt. Een tweede uitdaging ligt in het aanpassen van een standaard autoregressief model om hoogfrequente fysieke signalen in bewegingssequenties effectief te modelleren. Daarom stellen wij DSFT voor, een dual-stream frequentietokenizer die beweging opsplitst in Base- en fysieke streams en deze onafhankelijk comprimeert met DCT-truncatie en BPE. Verder presenteren wij MotionVLA, een op Qwen3.5 gebaseerd model dat Base- en fysieke tokens in een uniforme volgorde rangschikt, waarbij Phys-tokens worden voorspeld na Base-tokens. Experimenten op HumanML3D en MBench tonen aan dat MotionVLA, ondanks het gebruik van een lichtgewicht 2B-backbone, de diversiteitskloof met echte gegevens op HumanML3D met meer dan 50% verkleint en de bewegingsconditieconsistentie op MBench met 3,8% verbetert, wat frequentiebewuste dual-stream-ontkoppeling ondersteunt als een effectieve formulering voor autoregressieve bewegingsgeneratie. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.

English

Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.