经验乃最佳导师：通过自生成记忆实现机器人视觉语言模型的实践基础

摘要

视觉语言模型（VLMs）在机器人领域已被广泛采用，以实现自主规划。然而，将最初基于互联网数据训练的VLMs应用于多样化的现实世界机器人仍面临挑战。本文提出了ExpTeach框架，通过构建自我生成的现实世界经验记忆，将VLMs与物理机器人进行对接。在ExpTeach中，VLM自主规划动作、验证结果、反思失败，并在闭环中调整机器人行为。这一过程中产生的自我生成经验随后被总结为长期记忆，通过检索增强生成（RAG）技术，能够检索已学知识以指导未来任务。此外，ExpTeach通过按需图像标注模块增强了VLMs的空间理解能力。实验表明，反思将四项挑战性机器人任务的成功率从36%提升至84%，并观察到智能物体交互行为的涌现，包括创造性的工具使用。在12个现实世界场景（其中8个为未见过的场景）的广泛测试中，我们发现结合长期记忆的对接使单次尝试成功率从22%提高至80%，充分证明了ExpTeach的有效性和泛化能力。

English

Vision-language models (VLMs) have been widely adopted in robotics to enable autonomous planning. However, grounding VLMs, originally trained on internet data, to diverse real-world robots remains a challenge. This paper presents ExpTeach, a framework that grounds VLMs to physical robots by building a self-generated memory of real-world experiences. In ExpTeach, the VLM autonomously plans actions, verifies outcomes, reflects on failures, and adapts robot behaviors in a closed loop. The self-generated experiences during this process are then summarized into a long-term memory, enabling retrieval of learned knowledge to guide future tasks via retrieval-augmented generation (RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with an on-demand image annotation module. In experiments, we show that reflection improves success rates from 36% to 84% on four challenging robotic tasks and observe the emergence of intelligent object interactions, including creative tool use. Across extensive tests on 12 real-world scenarios (including eight unseen ones), we find that grounding with long-term memory boosts single-trial success rates from 22% to 80%, demonstrating the effectiveness and generalizability of ExpTeach.

经验乃最佳导师：通过自生成记忆实现机器人视觉语言模型的实践基础

Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory

摘要

Support