自生成上下文示例提升LLM代理在序列决策任务中的表现
Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks
May 1, 2025
作者: Vishnu Sarukkai, Zhiqiang Xie, Kayvon Fatahalian
cs.AI
摘要
提升大型语言模型(LLM)代理在序列决策任务中表现的诸多方法,往往依赖于特定任务的知识工程——例如提示调优、精心挑选的上下文示例,或是定制的观察与动作空间。采用这些方法时,代理的性能随着知识工程投入的质量或数量而提升。然而,我们探索的是LLM代理如何通过从自身在相似任务上的成功经验中进行上下文学习,自动提升其表现。我们摒弃了对特定任务知识工程的依赖,转而专注于构建并优化一个自生成示例的数据库。研究表明,即便是在训练任务中简单累积成功轨迹,也能在三个基准测试上显著提升测试性能:ALFWorld(从73%提升至89%)、Wordcraft(从55%提升至64%)以及InterCode-SQL(从75%提升至79%)——这一表现与初始代理在每项任务允许尝试两到三次时达到的水平相当。随后,我们引入了两项扩展:(1)通过基于群体的训练进行数据库级别的筛选,以识别出高效示例集合;(2)示例级别的筛选,依据其作为上下文示例的实际效用保留个别轨迹。这些扩展进一步提升了性能,在ALFWorld上达到了91%的准确率——与那些采用特定任务组件和提示的更复杂方法相媲美。我们的研究成果表明,自动构建轨迹数据库为替代劳动密集型知识工程提供了一条极具吸引力的路径。
English
Many methods for improving Large Language Model (LLM) agents for sequential
decision-making tasks depend on task-specific knowledge engineering--such as
prompt tuning, curated in-context examples, or customized observation and
action spaces. Using these approaches, agent performance improves with the
quality or amount of knowledge engineering invested. Instead, we investigate
how LLM agents can automatically improve their performance by learning
in-context from their own successful experiences on similar tasks. Rather than
relying on task-specific knowledge engineering, we focus on constructing and
refining a database of self-generated examples. We demonstrate that even a
naive accumulation of successful trajectories across training tasks boosts test
performance on three benchmarks: ALFWorld (73% to 89%), Wordcraft (55% to 64%),
and InterCode-SQL (75% to 79%)--matching the performance the initial agent
achieves if allowed two to three attempts per task. We then introduce two
extensions: (1) database-level selection through population-based training to
identify high-performing example collections, and (2) exemplar-level selection
that retains individual trajectories based on their empirical utility as
in-context examples. These extensions further enhance performance, achieving
91% on ALFWorld--matching more complex approaches that employ task-specific
components and prompts. Our results demonstrate that automatic trajectory
database construction offers a compelling alternative to labor-intensive
knowledge engineering.Summary
AI-Generated Summary