ChatPaper.aiChatPaper

一个实用的VLA基础模型

A Pragmatic VLA Foundation Model

January 26, 2026
作者: Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, Kecheng Zheng
cs.AI

摘要

在机器人操作领域展现出巨大潜力,具备卓越能力的视觉-语言-动作基础模型应当能够忠实泛化至不同任务与平台,同时确保成本效益(例如适应过程所需的数据量与GPU时耗)。为此,我们基于9种主流双机械臂配置采集的约2万小时真实数据,开发出LingBot-VLA模型。通过对3个机器人平台进行系统评估(每个平台完成100项任务,每项任务包含130次训练后测试),我们的模型显著超越同类方案,展现出卓越性能与广泛泛化能力。我们还构建了高效代码库,在8卡GPU训练配置下实现每秒261样本的吞吐量,相较现有VLA专用代码库提速1.5~2.8倍(具体取决于所基于的VLM基础模型)。这些特性确保我们的模型非常适合实际场景部署。为推进机器人学习领域发展,我们开源了代码、基础模型与基准数据,致力于支持更具挑战性的任务并推动建立科学的评估标准。
English
Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second per GPU with an 8-GPU training setup, representing a 1.5~2.8times (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.
PDF262January 29, 2026