SWEET-RL: 協調的推論タスクにおけるマルチターンLLMエージェントのトレーニング

要旨

大規模言語モデル（LLM）エージェントは、現実世界のタスクにおいて多段階のインタラクションを実行する必要がある。しかし、LLMエージェントを最適化するための既存の多段階強化学習（RL）アルゴリズムは、複数の段階にわたる効果的なクレジット割り当てを行いながらLLMの汎化能力を活用することができず、そのようなアルゴリズムをどのように開発するかは未だ不明確である。この問題を研究するため、我々はまず新しいベンチマーク「ColBench」を導入した。このベンチマークでは、LLMエージェントが人間の協力者と複数の段階にわたってインタラクションを行い、バックエンドプログラミングやフロントエンドデザインといった現実的なタスクを解決する。このベンチマークを基に、我々は新しいRLアルゴリズム「SWEET-RL（Step-WisE Evaluation from Training-time informationを活用したRL）」を提案した。このアルゴリズムは、追加のトレーニング時情報にアクセス可能な批評家モデルを訓練するために慎重に設計された最適化目標を使用する。批評家は、ポリシーモデルを改善するためのステップレベルの報酬を提供する。我々の実験では、SWEET-RLがColBenchにおいて他の最先端の多段階RLアルゴリズムと比較して成功率と勝利率で6%の絶対的な改善を達成し、Llama-3.1-8Bが現実的な協調コンテンツ作成においてGPT4-oの性能に匹敵またはそれを上回ることを示した。

English

Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench, where an LLM agent interacts with a human collaborator over multiple turns to solve realistic tasks in backend programming and frontend design. Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL with Step-WisE Evaluation from Training-time information), that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. The critic provides step-level rewards for improving the policy model. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms, enabling Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic collaborative content creation.

SWEET-RL: 協調的推論タスクにおけるマルチターンLLMエージェントのトレーニング

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

要旨

Support