經驗為最佳導師：透過自我生成記憶實現視覺語言模型於機器人領域的紮根

摘要

視覺語言模型（VLMs）在機器人領域已被廣泛採用，以實現自主規劃。然而，將原本基於互聯網數據訓練的VLMs應用於多樣化的現實世界機器人仍是一大挑戰。本文提出ExpTeach框架，通過構建自我生成的現實世界經驗記憶，將VLMs與實體機器人進行對接。在ExpTeach中，VLM自主規劃行動、驗證結果、反思失敗並在閉環中調整機器人行為。此過程中產生的自我生成經驗隨後被總結為長期記憶，使得能夠通過檢索增強生成（RAG）技術檢索已學知識來指導未來任務。此外，ExpTeach利用按需圖像註釋模塊增強了VLMs的空間理解能力。實驗中，我們展示了反思將四項具挑戰性的機器人任務的成功率從36%提升至84%，並觀察到智能物體交互的出現，包括創造性工具的使用。在對12個現實世界場景（其中8個為未見過的場景）的廣泛測試中，我們發現基於長期記憶的對接將單次試驗成功率從22%提升至80%，證明了ExpTeach的有效性和泛化能力。

English

Vision-language models (VLMs) have been widely adopted in robotics to enable autonomous planning. However, grounding VLMs, originally trained on internet data, to diverse real-world robots remains a challenge. This paper presents ExpTeach, a framework that grounds VLMs to physical robots by building a self-generated memory of real-world experiences. In ExpTeach, the VLM autonomously plans actions, verifies outcomes, reflects on failures, and adapts robot behaviors in a closed loop. The self-generated experiences during this process are then summarized into a long-term memory, enabling retrieval of learned knowledge to guide future tasks via retrieval-augmented generation (RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with an on-demand image annotation module. In experiments, we show that reflection improves success rates from 36% to 84% on four challenging robotic tasks and observe the emergence of intelligent object interactions, including creative tool use. Across extensive tests on 12 real-world scenarios (including eight unseen ones), we find that grounding with long-term memory boosts single-trial success rates from 22% to 80%, demonstrating the effectiveness and generalizability of ExpTeach.

經驗為最佳導師：透過自我生成記憶實現視覺語言模型於機器人領域的紮根

Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory

摘要

Support