ICAL:通過將軌跡轉化為可操作的洞察,實現多模態智能體的持續學習。
ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights
June 20, 2024
作者: Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki
cs.AI
摘要
大規模生成式語言和視覺-語言模型(LLMs和VLMs)在少量樣本內容學習方面表現出色,適用於決策制定和指導。然而,它們需要高質量的示範示例包含在其上下文窗口中。在這項工作中,我們問:LLMs和VLMs能否從通用的次優示範中生成自己的提示示例?我們提出了上下文抽象學習(ICAL),一種從次優示範和人類反饋中建立多模態經驗見解記憶的方法。在新領域中給定一個嘈雜示範時,VLMs將軌跡抽象成一個通用程序,通過修正低效動作並註釋認知抽象:任務關係、物體狀態變化、時間子目標和任務解釋。這些抽象通過人類反饋進行互動式地精煉和適應,同時代理嘗試在類似環境中執行軌跡。當這些抽象被用作提示中的示範時,顯著改善了檢索增強的LLM和VLM代理的決策能力。我們的ICAL代理在TEACh的基於對話的指導中超越了最先進技術,在VisualWebArena的多模態網頁代理以及Ego4D的動作預測中。在TEACh中,我們實現了目標條件成功率提高了12.6%。在VisualWebArena中,我們的任務成功率從14.3%提高到22.7%。在Ego4D的動作預測中,我們超越了少量樣本的GPT-4V,並且與監督模型保持競爭力。我們展示了對我們的檢索增強上下文代理進行微調可以額外改善。我們的方法顯著減少了對專家製作的示例的依賴,並且在缺乏這些見解的行動計劃的上下文學習中始終表現優異。
English
Large-scale generative language and vision-language models (LLMs and VLMs)
excel in few-shot in-context learning for decision making and instruction
following. However, they require high-quality exemplar demonstrations to be
included in their context window. In this work, we ask: Can LLMs and VLMs
generate their own prompt examples from generic, sub-optimal demonstrations? We
propose In-Context Abstraction Learning (ICAL), a method that builds a memory
of multimodal experience insights from sub-optimal demonstrations and human
feedback. Given a noisy demonstration in a new domain, VLMs abstract the
trajectory into a general program by fixing inefficient actions and annotating
cognitive abstractions: task relationships, object state changes, temporal
subgoals, and task construals. These abstractions are refined and adapted
interactively through human feedback while the agent attempts to execute the
trajectory in a similar environment. The resulting abstractions, when used as
exemplars in the prompt, significantly improve decision-making in
retrieval-augmented LLM and VLM agents. Our ICAL agent surpasses the
state-of-the-art in dialogue-based instruction following in TEACh, multimodal
web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we
achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our
task success rate improves over the SOTA from 14.3% to 22.7%. In Ego4D action
forecasting, we improve over few-shot GPT-4V and remain competitive with
supervised models. We show finetuning our retrieval-augmented in-context agent
yields additional improvements. Our approach significantly reduces reliance on
expert-crafted examples and consistently outperforms in-context learning from
action plans that lack such insights.Summary
AI-Generated Summary