SWEET-RL：在多轮协作推理任务上训练大语言模型代理

摘要

大型语言模型（LLM）代理在现实任务中需进行多轮交互。然而，现有的用于优化LLM代理的多轮强化学习（RL）算法，在利用LLM泛化能力的同时，未能有效实现多轮间的信用分配，且如何开发此类算法仍不明确。为此，我们首先引入了一个新基准——ColBench，其中LLM代理需与人类协作者进行多轮互动，以解决后端编程和前端设计中的实际任务。基于此基准，我们提出了一种新颖的RL算法，SWEET-RL（基于训练时信息的逐步评估强化学习），该算法通过精心设计的优化目标，训练一个能够访问额外训练时信息的评论家模型。该评论家为策略模型提供步骤级奖励以促进其改进。实验表明，相较于其他最先进的多轮RL算法，SWEET-RL在ColBench上的成功率和胜率实现了6%的绝对提升，使Llama-3.1-8B在现实协作内容创作中的表现达到或超越了GPT4-o的水平。

English

Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench, where an LLM agent interacts with a human collaborator over multiple turns to solve realistic tasks in backend programming and frontend design. Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL with Step-WisE Evaluation from Training-time information), that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. The critic provides step-level rewards for improving the policy model. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms, enabling Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic collaborative content creation.

SWEET-RL：在多轮协作推理任务上训练大语言模型代理

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

摘要

Support