MotionVLA:面向人形機器人運動的視覺-語言-動作模型
MotionVLA: Vision-Language-Action Model for Humanoid Motion
June 13, 2026
作者: Nonghai Zhang, Siyu Zhai, Yanjun Li, Zeyu Zhang, Zhihan Yin, Yandong Guo, Boxin Shi, Hao Tang
cs.AI
摘要
從場景影像與文字生成逼真的人體動作,涉及低頻姿態語義與高頻物理動態。然而,現有許多方法使用單一共享碼本將動作進行分詞化,迫使異質動作訊號被壓縮至同一量化空間。我們對人體動作資料進行的頻域分析揭示,單一碼本量化與動作統計之間存在明顯不匹配:五個離散餘弦變換(DCT)係數捕捉了關節位置能量的93%,卻僅捕捉關節速度能量的37%,這可能導致量化偏向於姿態統計,並忽略高頻速度成分。第二項挑戰在於如何調整標準自迴歸模型,以有效建模動作序列中的高頻物理訊號。為此,我們提出DSFT——一種雙流頻率分詞器,將動作分離為基礎流與物理流,並分別透過DCT截斷與位元組對編碼(BPE)進行獨立壓縮。此外,我們提出MotionVLA——一個基於Qwen3.5的模型,將基礎標記與物理標記排列在同一序列中,其中物理標記在基礎標記之後進行預測。在HumanML3D與MBench上的實驗顯示,儘管使用輕量級2B骨幹網絡,MotionVLA在HumanML3D上將與真實資料的多樣性差距降低超過50%,並在MBench上將動作條件一致性提升3.8%,驗證了頻率感知雙流解耦作為自迴歸動作生成的有效框架。程式碼:https://github.com/AIGeeksGroup/MotionVLA。網站:https://aigeeksgroup.github.io/MotionVLA。
English
Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.