ChatPaper.aiChatPaper

经验乃最佳导师:通过自生成记忆实现机器人视觉语言模型的实践基础

Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory

July 22, 2025
作者: Guowei Lan, Kaixian Qu, René Zurbrügg, Changan Chen, Christopher E. Mower, Haitham Bou-Ammar, Marco Hutter
cs.AI

摘要

视觉语言模型(VLMs)在机器人领域已被广泛采用,以实现自主规划。然而,将最初基于互联网数据训练的VLMs应用于多样化的现实世界机器人仍面临挑战。本文提出了ExpTeach框架,通过构建自我生成的现实世界经验记忆,将VLMs与物理机器人进行对接。在ExpTeach中,VLM自主规划动作、验证结果、反思失败,并在闭环中调整机器人行为。这一过程中产生的自我生成经验随后被总结为长期记忆,通过检索增强生成(RAG)技术,能够检索已学知识以指导未来任务。此外,ExpTeach通过按需图像标注模块增强了VLMs的空间理解能力。实验表明,反思将四项挑战性机器人任务的成功率从36%提升至84%,并观察到智能物体交互行为的涌现,包括创造性的工具使用。在12个现实世界场景(其中8个为未见过的场景)的广泛测试中,我们发现结合长期记忆的对接使单次尝试成功率从22%提高至80%,充分证明了ExpTeach的有效性和泛化能力。
English
Vision-language models (VLMs) have been widely adopted in robotics to enable autonomous planning. However, grounding VLMs, originally trained on internet data, to diverse real-world robots remains a challenge. This paper presents ExpTeach, a framework that grounds VLMs to physical robots by building a self-generated memory of real-world experiences. In ExpTeach, the VLM autonomously plans actions, verifies outcomes, reflects on failures, and adapts robot behaviors in a closed loop. The self-generated experiences during this process are then summarized into a long-term memory, enabling retrieval of learned knowledge to guide future tasks via retrieval-augmented generation (RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with an on-demand image annotation module. In experiments, we show that reflection improves success rates from 36% to 84% on four challenging robotic tasks and observe the emergence of intelligent object interactions, including creative tool use. Across extensive tests on 12 real-world scenarios (including eight unseen ones), we find that grounding with long-term memory boosts single-trial success rates from 22% to 80%, demonstrating the effectiveness and generalizability of ExpTeach.
PDF212July 23, 2025