经验乃最佳导师:通过自生成记忆实现机器人视觉语言模型的实践基础
Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory
July 22, 2025
作者: Guowei Lan, Kaixian Qu, René Zurbrügg, Changan Chen, Christopher E. Mower, Haitham Bou-Ammar, Marco Hutter
cs.AI
摘要
视觉语言模型(VLMs)在机器人领域已被广泛采用,以实现自主规划。然而,将最初基于互联网数据训练的VLMs应用于多样化的现实世界机器人仍面临挑战。本文提出了ExpTeach框架,通过构建自我生成的现实世界经验记忆,将VLMs与物理机器人进行对接。在ExpTeach中,VLM自主规划动作、验证结果、反思失败,并在闭环中调整机器人行为。这一过程中产生的自我生成经验随后被总结为长期记忆,通过检索增强生成(RAG)技术,能够检索已学知识以指导未来任务。此外,ExpTeach通过按需图像标注模块增强了VLMs的空间理解能力。实验表明,反思将四项挑战性机器人任务的成功率从36%提升至84%,并观察到智能物体交互行为的涌现,包括创造性的工具使用。在12个现实世界场景(其中8个为未见过的场景)的广泛测试中,我们发现结合长期记忆的对接使单次尝试成功率从22%提高至80%,充分证明了ExpTeach的有效性和泛化能力。
English
Vision-language models (VLMs) have been widely adopted in robotics to enable
autonomous planning. However, grounding VLMs, originally trained on internet
data, to diverse real-world robots remains a challenge. This paper presents
ExpTeach, a framework that grounds VLMs to physical robots by building a
self-generated memory of real-world experiences. In ExpTeach, the VLM
autonomously plans actions, verifies outcomes, reflects on failures, and adapts
robot behaviors in a closed loop. The self-generated experiences during this
process are then summarized into a long-term memory, enabling retrieval of
learned knowledge to guide future tasks via retrieval-augmented generation
(RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with
an on-demand image annotation module. In experiments, we show that reflection
improves success rates from 36% to 84% on four challenging robotic tasks and
observe the emergence of intelligent object interactions, including creative
tool use. Across extensive tests on 12 real-world scenarios (including eight
unseen ones), we find that grounding with long-term memory boosts single-trial
success rates from 22% to 80%, demonstrating the effectiveness and
generalizability of ExpTeach.