ChatPaper.aiChatPaper

MotionVLA:用于人形机器人运动的视觉-语言-动作模型

MotionVLA: Vision-Language-Action Model for Humanoid Motion

June 13, 2026
作者: Nonghai Zhang, Siyu Zhai, Yanjun Li, Zeyu Zhang, Zhihan Yin, Yandong Guo, Boxin Shi, Hao Tang
cs.AI

摘要

从场景图像和文本生成逼真的人体运动涉及低频姿态语义和高频物理动力学。然而,许多现有方法使用单一共享码本对运动进行标记化,将异构运动信号强制映射到同一量化空间。我们通过对人体运动数据进行频域分析,发现单码本量化与运动统计之间存在明显不匹配:五个DCT系数可捕获关节位置能量的93%,但仅捕获关节速度能量的37%,这可能导致量化偏向于姿态统计,而低频地表示高频速度分量。第二个挑战在于如何调整标准自回归模型以有效建模运动序列中的高频物理信号。为此,我们提出DSFT,一种双流频率分词器,将运动分离为基础流和物理流,并通过DCT截断与BPE分别独立压缩。此外,我们提出MotionVLA,一种基于Qwen3.5的模型,将基础令牌与物理令牌排列在统一序列中,其中物理令牌在基础令牌之后进行预测。在HumanML3D和MBench上的实验表明,尽管使用轻量级2B骨干网络,MotionVLA在HumanML3D上将与真实数据的多样性差异降低了超过50%,并在MBench上将运动-条件一致性提升了3.8%,支持频率感知的双流解耦作为自回归运动生成的有效范式。代码:https://github.com/AIGeeksGroup/MotionVLA。网站:https://aigeeksgroup.github.io/MotionVLA。
English
Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.