전문가의 실패 사례 탐구를 통한 LLM 에이전트 튜닝 개선

초록

대규모 언어 모델(LLMs)은 에이전트로서의 엄청난 잠재력을 보여주며, 다중 단계의 추론과 상호작용이 필요한 작업에서 뛰어난 성능을 발휘합니다. 거부 샘플링 미세 조정(RFT)은 LLMs를 에이전트로 미세 조정하는 효과적인 방법으로 부상했습니다: 이 방법은 먼저 전문가가 생성한 성공적인 궤적을 모방하고, 성공적으로 자체 생성된 궤적에 대한 반복적인 미세 조정을 통해 에이전트의 기술을 더욱 향상시킵니다. 그러나 전문가(예: GPT-4)가 주로 더 간단한 하위 작업에서 성공하고 RFT가 본질적으로 더 간단한 시나리오를 선호하기 때문에, 많은 복잡한 하위 작업은 여전히 해결되지 않고 지속적으로 분포 외(OOD) 상태로 남아 있습니다. 이러한 도전적인 하위 작업을 조사한 결과, 이전에 실패한 전문가 궤적이 종종 가치 있는 지침(예: 계획 및 핵심 행동)을 제공할 수 있으며, 이는 에이전트 탐색 효율성과 핵심 기술 습득을 크게 향상시킬 수 있음을 발견했습니다. 이러한 관찰에 동기를 부여받아, 우리는 실패한 전문가 궤적에서 유익한 행동을 식별하고 이를 훈련 데이터셋에 통합하는 Exploring Expert Failures(EEF)를 제안합니다. 잠재적으로 유해한 행동은 모델 학습 과정을 오염시키지 않도록 신중하게 제외됩니다. 전문가 실패에서 유익한 행동을 활용함으로써, EEF는 이전에 해결할 수 없었던 일부 하위 작업을 성공적으로 해결하고 에이전트 조정 성능을 향상시킵니다. 특히, 우리의 접근 방식은 WebShop에서 62%의 승률을 달성하여 RFT(53.6%)와 GPT-4(35.6%)를 능가했으며, 우리가 아는 한 WebShop에서 0.81점을 넘고 SciWorld에서 81점을 초과하는 최초의 방법으로 새로운 최첨단 기술을 설정했습니다.

English

Large Language Models (LLMs) have shown tremendous potential as agents, excelling at tasks that require multiple rounds of reasoning and interactions. Rejection Sampling Fine-Tuning (RFT) has emerged as an effective method for finetuning LLMs as agents: it first imitates expert-generated successful trajectories and further improves agentic skills through iterative fine-tuning on successful, self-generated trajectories. However, since the expert (e.g., GPT-4) succeeds primarily on simpler subtasks and RFT inherently favors simpler scenarios, many complex subtasks remain unsolved and persistently out-of-distribution (OOD). Upon investigating these challenging subtasks, we discovered that previously failed expert trajectories can often provide valuable guidance, e.g., plans and key actions, that can significantly improve agent exploration efficiency and acquisition of critical skills. Motivated by these observations, we propose Exploring Expert Failures (EEF), which identifies beneficial actions from failed expert trajectories and integrates them into the training dataset. Potentially harmful actions are meticulously excluded to prevent contamination of the model learning process. By leveraging the beneficial actions in expert failures, EEF successfully solves some previously unsolvable subtasks and improves agent tuning performance. Remarkably, our approach achieved a 62\% win rate in WebShop, outperforming RFT (53. 6\%) and GPT-4 (35. 6\%), and to the best of our knowledge, setting a new state-of-the-art as the first method to surpass a score of 0.81 in WebShop and exceed 81 in SciWorld.

전문가의 실패 사례 탐구를 통한 LLM 에이전트 튜닝 개선

Exploring Expert Failures Improves LLM Agent Tuning

초록

Support