自己生成された文脈内の例が、逐次的意思決定タスクにおけるLLMエージェントを改善する

要旨

逐次的意思決定タスクにおける大規模言語モデル（LLM）エージェントの性能向上のための多くの手法は、タスク固有の知識エンジニアリングに依存しています。例えば、プロンプトチューニング、精選されたインコンテキスト例、またはカスタマイズされた観測空間と行動空間などが挙げられます。これらのアプローチでは、エージェントの性能は投入された知識エンジニアリングの質や量に比例して向上します。しかし、本研究では、LLMエージェントが類似タスクにおける自身の成功経験からインコンテキストで学習することで、自動的に性能を向上させる方法を探ります。タスク固有の知識エンジニアリングに頼る代わりに、自己生成された例のデータベースを構築し、洗練することに焦点を当てます。我々は、訓練タスク全体での成功軌跡を単純に蓄積するだけで、3つのベンチマーク（ALFWorld：73%から89%、Wordcraft：55%から64%、InterCode-SQL：75%から79%）においてテスト性能が向上し、初期エージェントがタスクごとに2〜3回の試行を許可された場合の性能に匹敵することを実証しました。さらに、2つの拡張を導入します：（1）集団ベースのトレーニングを通じたデータベースレベルの選択により、高性能な例のコレクションを特定し、（2）個々の軌跡をそのインコンテキスト例としての実用性に基づいて保持するエクゼンプラーレベルの選択です。これらの拡張により、ALFWorldで91%の性能を達成し、タスク固有のコンポーネントやプロンプトを使用するより複雑なアプローチに匹敵する結果を得ました。我々の結果は、自動的な軌跡データベース構築が、労力を要する知識エンジニアリングに対する有力な代替手段であることを示しています。

English

Many methods for improving Large Language Model (LLM) agents for sequential decision-making tasks depend on task-specific knowledge engineering--such as prompt tuning, curated in-context examples, or customized observation and action spaces. Using these approaches, agent performance improves with the quality or amount of knowledge engineering invested. Instead, we investigate how LLM agents can automatically improve their performance by learning in-context from their own successful experiences on similar tasks. Rather than relying on task-specific knowledge engineering, we focus on constructing and refining a database of self-generated examples. We demonstrate that even a naive accumulation of successful trajectories across training tasks boosts test performance on three benchmarks: ALFWorld (73% to 89%), Wordcraft (55% to 64%), and InterCode-SQL (75% to 79%)--matching the performance the initial agent achieves if allowed two to three attempts per task. We then introduce two extensions: (1) database-level selection through population-based training to identify high-performing example collections, and (2) exemplar-level selection that retains individual trajectories based on their empirical utility as in-context examples. These extensions further enhance performance, achieving 91% on ALFWorld--matching more complex approaches that employ task-specific components and prompts. Our results demonstrate that automatic trajectory database construction offers a compelling alternative to labor-intensive knowledge engineering.

自己生成された文脈内の例が、逐次的意思決定タスクにおけるLLMエージェントを改善する

Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

要旨

Support