専門家の失敗事例を探求することでLLMエージェントのチューニングが改善される

要旨

大規模言語モデル（LLM）はエージェントとして非常に大きな可能性を示しており、複数回の推論と相互作用を必要とするタスクにおいて優れた性能を発揮しています。Rejection Sampling Fine-Tuning（RFT）は、LLMをエージェントとしてファインチューニングするための効果的な手法として登場しました。この手法は、まず専門家が生成した成功軌跡を模倣し、その後、自己生成した成功軌跡に対する反復的なファインチューニングを通じてエージェントのスキルをさらに向上させます。しかし、専門家（例：GPT-4）が主に単純なサブタスクで成功し、RFTが本質的に単純なシナリオを好むため、多くの複雑なサブタスクは未解決のままであり、分布外（OOD）の状態が続いています。これらの難しいサブタスクを調査した結果、以前に失敗した専門家の軌跡がしばしば貴重なガイダンス（例：計画や重要なアクション）を提供し、エージェントの探索効率と重要なスキルの習得を大幅に改善できることがわかりました。これらの観察に基づき、我々はExploring Expert Failures（EEF）を提案します。EEFは、失敗した専門家の軌跡から有益なアクションを特定し、それらをトレーニングデータセットに統合します。モデルの学習プロセスを汚染しないよう、潜在的に有害なアクションは慎重に除外されます。専門家の失敗から有益なアクションを活用することで、EEFは以前に解決できなかったいくつかのサブタスクを解決し、エージェントのチューニング性能を向上させます。特に、我々のアプローチはWebShopで62％の勝率を達成し、RFT（53.6％）やGPT-4（35.6％）を上回り、我々の知る限り、WebShopで0.81を超える初の手法として新たな最先端を確立し、SciWorldでも81を超えるスコアを達成しました。

English

Large Language Models (LLMs) have shown tremendous potential as agents, excelling at tasks that require multiple rounds of reasoning and interactions. Rejection Sampling Fine-Tuning (RFT) has emerged as an effective method for finetuning LLMs as agents: it first imitates expert-generated successful trajectories and further improves agentic skills through iterative fine-tuning on successful, self-generated trajectories. However, since the expert (e.g., GPT-4) succeeds primarily on simpler subtasks and RFT inherently favors simpler scenarios, many complex subtasks remain unsolved and persistently out-of-distribution (OOD). Upon investigating these challenging subtasks, we discovered that previously failed expert trajectories can often provide valuable guidance, e.g., plans and key actions, that can significantly improve agent exploration efficiency and acquisition of critical skills. Motivated by these observations, we propose Exploring Expert Failures (EEF), which identifies beneficial actions from failed expert trajectories and integrates them into the training dataset. Potentially harmful actions are meticulously excluded to prevent contamination of the model learning process. By leveraging the beneficial actions in expert failures, EEF successfully solves some previously unsolvable subtasks and improves agent tuning performance. Remarkably, our approach achieved a 62\% win rate in WebShop, outperforming RFT (53. 6\%) and GPT-4 (35. 6\%), and to the best of our knowledge, setting a new state-of-the-art as the first method to surpass a score of 0.81 in WebShop and exceed 81 in SciWorld.

専門家の失敗事例を探求することでLLMエージェントのチューニングが改善される

Exploring Expert Failures Improves LLM Agent Tuning

要旨

Support