SWEET-RL：在多輪協作推理任務上訓練大型語言模型代理

摘要

大型語言模型（LLM）代理在現實世界任務中需要進行多輪互動。然而，現有的用於優化LLM代理的多輪強化學習（RL）算法無法在多輪互動中進行有效的信用分配，同時充分利用LLM的泛化能力，且如何開發此類算法仍不明確。為研究此問題，我們首先引入了一個新的基準測試ColBench，其中LLM代理與人類協作者在多輪互動中共同解決後端編程和前端設計中的實際任務。基於此基準測試，我們提出了一種新穎的RL算法SWEET-RL（利用訓練時信息進行逐步評估的強化學習），該算法使用精心設計的優化目標來訓練一個能夠訪問額外訓練時信息的評論模型。該評論模型為改進策略模型提供逐步獎勵。我們的實驗表明，與其他最先進的多輪RL算法相比，SWEET-RL在ColBench上的成功率和勝率絕對提升了6%，使Llama-3.1-8B在實際協作內容創作中的表現能夠匹配或超越GPT4-o。

English

Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench, where an LLM agent interacts with a human collaborator over multiple turns to solve realistic tasks in backend programming and frontend design. Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL with Step-WisE Evaluation from Training-time information), that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. The critic provides step-level rewards for improving the policy model. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms, enabling Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic collaborative content creation.

SWEET-RL：在多輪協作推理任務上訓練大型語言模型代理

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

摘要

Support