ChatPaper.aiChatPaper

VLA^2:賦能視覺-語言-行動模型以代理框架實現未見概念操控

VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation

October 16, 2025
作者: Han Zhao, Jiaxuan Zhang, Wenxuan Song, Pengxiang Ding, Donglin Wang
cs.AI

摘要

当前,基于大规模机器人数据预训练的视觉-语言-动作(VLA)模型展现出强大的多任务处理能力,并能很好地适应视觉和语言指令的变化以执行操作任务。然而,当面对训练数据之外的对象概念时,如未见过的物体描述和纹理,其成功率显著下降。为解决这一问题,我们提出了一种新颖的代理框架VLA^2,该框架以OpenVLA为执行核心,并有效利用外部模块如网络检索和物体检测,为VLA提供目标对象的视觉和文本知识。这一方法在处理分布外对象时,有效缓解了泛化失败的问题。基于LIBERO仿真环境,我们引入了新的对象和对象描述,构建了一个包含三个难度等级的新评估基准,以测试我们方法的有效性。在我们的设计的高难度泛化基准测试中,该框架成功超越了当前最先进的模型。与独立的OpenVLA基线相比,VLA^2在难度基准测试中的成功率提高了44.2%,在所有定制环境中的平均提升率为20.2%,且未对域内任务造成任何性能下降。项目网站:https://vla-2.github.io。
English
Current vision-language-action (VLA) models, pre-trained on large-scale robotic data, exhibit strong multi-task capabilities and generalize well to variations in visual and language instructions for manipulation. However, their success rate drops significantly when faced with object concepts outside the training data, such as unseen object descriptions and textures in the dataset. To address this, we propose a novel agentic framework, VLA^2, which leverages OpenVLA as the execution backbone and effectively leverages external modules such as web retrieval and object detection to provide visual and textual knowledge about target objects to the VLA. This approach mitigates generalization failure when handling out-of-distribution objects. Based on the LIBERO simulation environment, we introduced novel objects and object descriptions to construct a new evaluation benchmark with three difficulty levels to test the effectiveness of our method. Our framework successfully outperformed the current state-of-the-art models on our designed hard-level generalization benchmark. Compared to the standalone OpenVLA baseline, VLA^2 achieves a 44.2% improvement in the success rate in the hard-level benchmark and an average improvement of 20.2% in all customized environments without any performance degradation on in-domain tasks. Project website: https://vla-2.github.io.
PDF132October 17, 2025