SWEET-RL: 협업적 추론 작업을 위한 다중 턴 LLM 에이전트 훈련

초록

대규모 언어 모델(LLM) 에이전트는 실제 작업에서 다중 턴 상호작용을 수행해야 합니다. 그러나 기존의 다중 턴 강화 학습(RL) 알고리즘은 LLM의 일반화 능력을 활용하면서 다중 턴에 걸쳐 효과적인 신용 할당을 수행하지 못하며, 이러한 알고리즘을 개발하는 방법은 여전히 명확하지 않습니다. 이를 연구하기 위해, 우리는 먼저 새로운 벤치마크인 ColBench를 소개합니다. 이 벤치마크에서는 LLM 에이전트가 인간 협력자와 다중 턴에 걸쳐 상호작용하며 백엔드 프로그래밍과 프론트엔드 디자인에서 현실적인 작업을 해결합니다. 이 벤치마크를 기반으로, 우리는 새로운 RL 알고리즘인 SWEET-RL(훈련 시간 정보를 활용한 단계별 평가를 통한 강화 학습)을 제안합니다. 이 알고리즘은 추가적인 훈련 시간 정보에 접근할 수 있는 비평 모델을 훈련하기 위해 신중하게 설계된 최적화 목표를 사용합니다. 비평 모델은 정책 모델을 개선하기 위한 단계별 보상을 제공합니다. 우리의 실험은 SWEET-RL이 ColBench에서 다른 최첨단 다중 턴 RL 알고리즘에 비해 성공률과 승률에서 6%의 절대적 개선을 달성함을 보여주며, Llama-3.1-8B가 현실적인 협업 콘텐츠 생성에서 GPT4-o의 성능을 따라잡거나 능가할 수 있게 합니다.

English

Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench, where an LLM agent interacts with a human collaborator over multiple turns to solve realistic tasks in backend programming and frontend design. Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL with Step-WisE Evaluation from Training-time information), that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. The critic provides step-level rewards for improving the policy model. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms, enabling Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic collaborative content creation.

SWEET-RL: 협업적 추론 작업을 위한 다중 턴 LLM 에이전트 훈련

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

초록

Support