経験は最良の教師：自己生成メモリを通じたロボティクス向け視覚言語モデルの基盤化

要旨

視覚言語モデル（VLM）は、自律的な計画を可能にするためにロボティクス分野で広く採用されている。しかし、インターネットデータで訓練されたVLMを多様な実世界のロボットに適用することは依然として課題である。本論文では、ExpTeachを提案する。これは、実世界の経験から自己生成されたメモリを構築することで、VLMを物理的なロボットに適用するフレームワークである。ExpTeachでは、VLMが自律的に行動を計画し、結果を検証し、失敗を反映し、ロボットの行動を閉ループで適応させる。このプロセス中に生成された経験は、長期的なメモリに要約され、検索拡張生成（RAG）を通じて将来のタスクを導くための学習済み知識の検索を可能にする。さらに、ExpTeachは、オンデマンドの画像注釈モジュールを用いてVLMの空間理解を強化する。実験では、4つの困難なロボットタスクにおいて、失敗の反映が成功率を36%から84%に向上させ、創造的な道具の使用を含む知的な物体相互作用の出現を観察した。12の実世界シナリオ（うち8つは未見のもの）での広範なテストを通じて、長期的なメモリを用いた適用が単一試行の成功率を22%から80%に向上させ、ExpTeachの有効性と汎用性を実証した。

English

Vision-language models (VLMs) have been widely adopted in robotics to enable autonomous planning. However, grounding VLMs, originally trained on internet data, to diverse real-world robots remains a challenge. This paper presents ExpTeach, a framework that grounds VLMs to physical robots by building a self-generated memory of real-world experiences. In ExpTeach, the VLM autonomously plans actions, verifies outcomes, reflects on failures, and adapts robot behaviors in a closed loop. The self-generated experiences during this process are then summarized into a long-term memory, enabling retrieval of learned knowledge to guide future tasks via retrieval-augmented generation (RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with an on-demand image annotation module. In experiments, we show that reflection improves success rates from 36% to 84% on four challenging robotic tasks and observe the emergence of intelligent object interactions, including creative tool use. Across extensive tests on 12 real-world scenarios (including eight unseen ones), we find that grounding with long-term memory boosts single-trial success rates from 22% to 80%, demonstrating the effectiveness and generalizability of ExpTeach.

経験は最良の教師：自己生成メモリを通じたロボティクス向け視覚言語モデルの基盤化

Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory

要旨

Support