ChatPaper.aiChatPaper

經驗為最佳導師:透過自我生成記憶實現視覺語言模型於機器人領域的紮根

Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory

July 22, 2025
作者: Guowei Lan, Kaixian Qu, René Zurbrügg, Changan Chen, Christopher E. Mower, Haitham Bou-Ammar, Marco Hutter
cs.AI

摘要

視覺語言模型(VLMs)在機器人領域已被廣泛採用,以實現自主規劃。然而,將原本基於互聯網數據訓練的VLMs應用於多樣化的現實世界機器人仍是一大挑戰。本文提出ExpTeach框架,通過構建自我生成的現實世界經驗記憶,將VLMs與實體機器人進行對接。在ExpTeach中,VLM自主規劃行動、驗證結果、反思失敗並在閉環中調整機器人行為。此過程中產生的自我生成經驗隨後被總結為長期記憶,使得能夠通過檢索增強生成(RAG)技術檢索已學知識來指導未來任務。此外,ExpTeach利用按需圖像註釋模塊增強了VLMs的空間理解能力。實驗中,我們展示了反思將四項具挑戰性的機器人任務的成功率從36%提升至84%,並觀察到智能物體交互的出現,包括創造性工具的使用。在對12個現實世界場景(其中8個為未見過的場景)的廣泛測試中,我們發現基於長期記憶的對接將單次試驗成功率從22%提升至80%,證明了ExpTeach的有效性和泛化能力。
English
Vision-language models (VLMs) have been widely adopted in robotics to enable autonomous planning. However, grounding VLMs, originally trained on internet data, to diverse real-world robots remains a challenge. This paper presents ExpTeach, a framework that grounds VLMs to physical robots by building a self-generated memory of real-world experiences. In ExpTeach, the VLM autonomously plans actions, verifies outcomes, reflects on failures, and adapts robot behaviors in a closed loop. The self-generated experiences during this process are then summarized into a long-term memory, enabling retrieval of learned knowledge to guide future tasks via retrieval-augmented generation (RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with an on-demand image annotation module. In experiments, we show that reflection improves success rates from 36% to 84% on four challenging robotic tasks and observe the emergence of intelligent object interactions, including creative tool use. Across extensive tests on 12 real-world scenarios (including eight unseen ones), we find that grounding with long-term memory boosts single-trial success rates from 22% to 80%, demonstrating the effectiveness and generalizability of ExpTeach.
PDF212July 23, 2025