ChatPaper.aiChatPaper

一個實用的超大規模基礎模型

A Pragmatic VLA Foundation Model

January 26, 2026
作者: Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, Kecheng Zheng
cs.AI

摘要

具備卓越機器人操作潛能的視覺-語言-動作基礎模型,應能忠實地跨任務與平台泛化,同時確保成本效益(例如適應過程所需的數據量與GPU時長)。為此,我們基於9種主流雙臂機器人配置、約2萬小時真實數據開發了LingBot-VLA模型。通過在3種機器人平台上進行系統性評估(每平台完成100項任務,每任務包含130次訓練後測試),我們的模型顯著超越同類方案,展現出強勁性能與廣泛泛化能力。我們還構建了高效代碼庫,在8GPU訓練配置下可實現每秒261樣本的處理吞吐量,相較現有VLA專用代碼庫提速1.5~2.8倍(具體取決於所基於的VLM骨幹模型)。上述特性確保模型具備實際部署優勢。為推動機器人學習領域發展,我們開源了代碼、基模型與基準數據集,致力於支持更具挑戰性的任務研究並促進科學評估標準的建立。
English
Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second per GPU with an 8-GPU training setup, representing a 1.5~2.8times (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.
PDF262January 29, 2026