ChatPaper.aiChatPaper

InternVLA-M1:一種空間引導的視覺-語言-動作框架,用於通用機器人策略

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

October 15, 2025
作者: Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, Yang Tian, Bin Wang, Bolun Wang, Fangjing Wang, Hanqing Wang, Tai Wang, Ziqin Wang, Xueyuan Wei, Chao Wu, Shuai Yang, Jinhui Ye, Junqiu Yu, Jia Zeng, Jingjing Zhang, Jinyu Zhang, Shi Zhang, Feng Zheng, Bowen Zhou, Yangkun Zhu
cs.AI

摘要

我們推出InternVLA-M1,這是一個用於空間定位與機器人控制的統一框架,旨在推動指令跟隨型機器人向可擴展、通用智能方向發展。其核心理念是空間引導的視覺-語言-動作訓練,其中空間定位作為連接指令與機器人動作的關鍵橋樑。InternVLA-M1採用兩階段流程:(i) 在超過230萬條空間推理數據上進行空間定位預訓練,通過將指令與視覺、無關具體形態的位置對齊來確定“在哪裡行動”;(ii) 空間引導的動作後訓練,通過即插即用的空間提示生成與形態相關的動作,決定“如何行動”。這種空間引導的訓練方法帶來了持續的性能提升:在SimplerEnv Google Robot上,InternVLA-M1比無空間引導的版本高出14.6%,在WidowX上高出17%,在LIBERO Franka上高出4.3%,同時在盒子、點和軌跡預測中展現出更強的空間推理能力。為了進一步擴展指令跟隨能力,我們構建了一個模擬引擎,收集了24.4萬個可泛化的抓取放置場景,使得在200個任務和3000多個對象上的平均性能提升了6.2%。在現實世界的密集抓取放置任務中,InternVLA-M1提升了7.3%,結合合成數據共同訓練,對未見物體和新配置的表現提升了20.6%。此外,在長時序推理密集型場景中,它超越了現有工作超過10%。這些成果凸顯了空間引導訓練作為構建可擴展且魯棒的通用機器人的統一原則。代碼和模型可在https://github.com/InternRobotics/InternVLA-M1獲取。
English
We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at https://github.com/InternRobotics/InternVLA-M1.
PDF162October 16, 2025