ICAL：通过将轨迹转化为可操作见解来实现多模态智能体的持续学习。

摘要

大规模生成式语言和视觉-语言模型（LLMs和VLMs）在少样本上下文学习中表现出色，用于决策制定和指导遵循。然而，它们需要高质量的示范示例包含在其上下文窗口中。在这项工作中，我们提出了一个问题：LLMs和VLMs能否从通用的次优示范中生成自己的提示示例？我们提出了上下文抽象学习（ICAL），这是一种方法，它从次优示范和人类反馈中建立了多模态体验见解的记忆。在新领域中给定一个嘈杂的示范，VLMs将轨迹抽象为一个通用程序，通过修复低效行为和注释认知抽象：任务关系、物体状态变化、时间子目标和任务解释。这些抽象通过人类反馈进行互动地进行细化和调整，同时代理尝试在类似环境中执行轨迹。当这些抽象被用作提示中的示例时，显著改善了检索增强的LLM和VLM代理的决策能力。我们的ICAL代理在TEACh中的基于对话的指导遵循、VisualWebArena中的多模态网络代理以及Ego4D中的动作预测方面超越了最新技术。在TEACh中，我们实现了目标条件成功率的提高12.6%。在VisualWebArena中，我们的任务成功率从14.3%提高到22.7%，超过了最新技术。在Ego4D的动作预测中，我们超越了少样本GPT-4V，并与监督模型保持竞争力。我们展示了对我们的检索增强上下文代理进行微调会带来额外的改进。我们的方法显著减少了对专家制作的示例的依赖，并始终优于缺乏这些见解的行动计划的上下文学习。

English

Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations to be included in their context window. In this work, we ask: Can LLMs and VLMs generate their own prompt examples from generic, sub-optimal demonstrations? We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience insights from sub-optimal demonstrations and human feedback. Given a noisy demonstration in a new domain, VLMs abstract the trajectory into a general program by fixing inefficient actions and annotating cognitive abstractions: task relationships, object state changes, temporal subgoals, and task construals. These abstractions are refined and adapted interactively through human feedback while the agent attempts to execute the trajectory in a similar environment. The resulting abstractions, when used as exemplars in the prompt, significantly improve decision-making in retrieval-augmented LLM and VLM agents. Our ICAL agent surpasses the state-of-the-art in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our task success rate improves over the SOTA from 14.3% to 22.7%. In Ego4D action forecasting, we improve over few-shot GPT-4V and remain competitive with supervised models. We show finetuning our retrieval-augmented in-context agent yields additional improvements. Our approach significantly reduces reliance on expert-crafted examples and consistently outperforms in-context learning from action plans that lack such insights.

ICAL：通过将轨迹转化为可操作见解来实现多模态智能体的持续学习。

ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

摘要

Support