InternVLA-M1:一种空间引导的视觉-语言-动作框架,面向通用机器人策略
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
October 15, 2025
作者: Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, Yang Tian, Bin Wang, Bolun Wang, Fangjing Wang, Hanqing Wang, Tai Wang, Ziqin Wang, Xueyuan Wei, Chao Wu, Shuai Yang, Jinhui Ye, Junqiu Yu, Jia Zeng, Jingjing Zhang, Jinyu Zhang, Shi Zhang, Feng Zheng, Bowen Zhou, Yangkun Zhu
cs.AI
摘要
我们推出了InternVLA-M1,这是一个用于空间定位与机器人控制的统一框架,旨在推动指令跟随型机器人向可扩展的通用智能迈进。其核心理念在于空间引导的视觉-语言-动作训练,其中空间定位作为指令与机器人行动之间的关键桥梁。InternVLA-M1采用两阶段流程:(i) 在超过230万条空间推理数据上进行空间定位预训练,通过将指令与视觉、与具体形态无关的位置对齐,确定“何处行动”;(ii) 进行空间引导的动作后训练,通过即插即用的空间提示生成与具体形态适配的动作,决定“如何行动”。这一空间引导的训练方案带来了持续的提升:InternVLA-M1在SimplerEnv Google Robot上比无空间引导的版本高出14.6%,在WidowX上高出17%,在LIBERO Franka上高出4.3%,同时在盒子、点和轨迹预测中展现出更强的空间推理能力。为了进一步扩展指令跟随能力,我们构建了一个模拟引擎,收集了24.4万条可泛化的抓取放置片段,实现了在200项任务和3000多个对象上平均6.2%的改进。在实际世界的密集抓取放置任务中,InternVLA-M1提升了7.3%,结合合成数据协同训练,在未见过的物体和新配置上达到了20.6%的提升。此外,在长时程、推理密集的场景中,它超越了现有工作超过10%。这些成果凸显了空间引导训练作为构建可扩展且鲁棒的通用机器人统一原则的重要性。代码与模型可在https://github.com/InternRobotics/InternVLA-M1获取。
English
We introduce InternVLA-M1, a unified framework for spatial grounding and
robot control that advances instruction-following robots toward scalable,
general-purpose intelligence. Its core idea is spatially guided
vision-language-action training, where spatial grounding serves as the critical
link between instructions and robot actions. InternVLA-M1 employs a two-stage
pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning
data to determine ``where to act'' by aligning instructions with visual,
embodiment-agnostic positions, and (ii) spatially guided action post-training
to decide ``how to act'' by generating embodiment-aware actions through
plug-and-play spatial prompting. This spatially guided training recipe yields
consistent gains: InternVLA-M1 outperforms its variant without spatial guidance
by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO
Franka, while demonstrating stronger spatial reasoning capability in box,
point, and trace prediction. To further scale instruction following, we built a
simulation engine to collect 244K generalizable pick-and-place episodes,
enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In
real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with
synthetic co-training, achieved +20.6% on unseen objects and novel
configurations. Moreover, in long-horizon reasoning-intensive scenarios, it
surpassed existing works by over 10%. These results highlight spatially guided
training as a unifying principle for scalable and resilient generalist robots.
Code and models are available at
https://github.com/InternRobotics/InternVLA-M1.