經驗為最佳導師:透過自我生成記憶實現視覺語言模型於機器人領域的紮根
Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory
July 22, 2025
作者: Guowei Lan, Kaixian Qu, René Zurbrügg, Changan Chen, Christopher E. Mower, Haitham Bou-Ammar, Marco Hutter
cs.AI
摘要
視覺語言模型(VLMs)在機器人領域已被廣泛採用,以實現自主規劃。然而,將原本基於互聯網數據訓練的VLMs應用於多樣化的現實世界機器人仍是一大挑戰。本文提出ExpTeach框架,通過構建自我生成的現實世界經驗記憶,將VLMs與實體機器人進行對接。在ExpTeach中,VLM自主規劃行動、驗證結果、反思失敗並在閉環中調整機器人行為。此過程中產生的自我生成經驗隨後被總結為長期記憶,使得能夠通過檢索增強生成(RAG)技術檢索已學知識來指導未來任務。此外,ExpTeach利用按需圖像註釋模塊增強了VLMs的空間理解能力。實驗中,我們展示了反思將四項具挑戰性的機器人任務的成功率從36%提升至84%,並觀察到智能物體交互的出現,包括創造性工具的使用。在對12個現實世界場景(其中8個為未見過的場景)的廣泛測試中,我們發現基於長期記憶的對接將單次試驗成功率從22%提升至80%,證明了ExpTeach的有效性和泛化能力。
English
Vision-language models (VLMs) have been widely adopted in robotics to enable
autonomous planning. However, grounding VLMs, originally trained on internet
data, to diverse real-world robots remains a challenge. This paper presents
ExpTeach, a framework that grounds VLMs to physical robots by building a
self-generated memory of real-world experiences. In ExpTeach, the VLM
autonomously plans actions, verifies outcomes, reflects on failures, and adapts
robot behaviors in a closed loop. The self-generated experiences during this
process are then summarized into a long-term memory, enabling retrieval of
learned knowledge to guide future tasks via retrieval-augmented generation
(RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with
an on-demand image annotation module. In experiments, we show that reflection
improves success rates from 36% to 84% on four challenging robotic tasks and
observe the emergence of intelligent object interactions, including creative
tool use. Across extensive tests on 12 real-world scenarios (including eight
unseen ones), we find that grounding with long-term memory boosts single-trial
success rates from 22% to 80%, demonstrating the effectiveness and
generalizability of ExpTeach.