SWEET-RL:在多輪協作推理任務上訓練大型語言模型代理
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
March 19, 2025
作者: Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, Xian Li
cs.AI
摘要
大型語言模型(LLM)代理在現實世界任務中需要進行多輪互動。然而,現有的用於優化LLM代理的多輪強化學習(RL)算法無法在多輪互動中進行有效的信用分配,同時充分利用LLM的泛化能力,且如何開發此類算法仍不明確。為研究此問題,我們首先引入了一個新的基準測試ColBench,其中LLM代理與人類協作者在多輪互動中共同解決後端編程和前端設計中的實際任務。基於此基準測試,我們提出了一種新穎的RL算法SWEET-RL(利用訓練時信息進行逐步評估的強化學習),該算法使用精心設計的優化目標來訓練一個能夠訪問額外訓練時信息的評論模型。該評論模型為改進策略模型提供逐步獎勵。我們的實驗表明,與其他最先進的多輪RL算法相比,SWEET-RL在ColBench上的成功率和勝率絕對提升了6%,使Llama-3.1-8B在實際協作內容創作中的表現能夠匹配或超越GPT4-o。
English
Large language model (LLM) agents need to perform multi-turn interactions in
real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM
agents fail to perform effective credit assignment over multiple turns while
leveraging the generalization capabilities of LLMs and it remains unclear how
to develop such algorithms. To study this, we first introduce a new benchmark,
ColBench, where an LLM agent interacts with a human collaborator over multiple
turns to solve realistic tasks in backend programming and frontend design.
Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL with
Step-WisE Evaluation from Training-time information), that uses a carefully
designed optimization objective to train a critic model with access to
additional training-time information. The critic provides step-level rewards
for improving the policy model. Our experiments demonstrate that SWEET-RL
achieves a 6% absolute improvement in success and win rates on ColBench
compared to other state-of-the-art multi-turn RL algorithms, enabling
Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic
collaborative content creation.Summary
AI-Generated Summary