マルチターン推論を強化するためのLLMエージェントにおけるターンレベル信用割当

要旨

本論文は、強化学習（Reinforcement Learning, RL）を用いて大規模言語モデル（Large Language Model, LLM）エージェントの推論能力を向上させるためのアプローチを探求する。特に、マルコフ決定過程（Markov Decision Process, MDP）として自然にモデル化できる多ターンツール使用シナリオに焦点を当てる。既存のアプローチでは、バンディット設定における軌道レベルのアドバンテージ推定を用いて多ターンLLMエージェントを訓練することが多いが、複数の意思決定ステップにわたるターンレベルのクレジット割り当てに苦戦し、多ターン推論タスクでの性能が制限されている。この問題に対処するため、本論文では、多ターンエージェント間相互作用においてより正確なクレジット割り当てを可能にする、細粒度のターンレベルアドバンテージ推定戦略を提案する。この戦略は汎用的であり、Group Relative Preference Optimization（GRPO）などの様々なRLアルゴリズムに組み込むことができる。GRPO実装を用いた多ターン推論および検索ベースのツール使用タスクにおける実験的評価は、MDPフレームワークとターンレベルクレジット割り当てが、複雑な意思決定設定におけるLLMエージェントの多ターン推論能力を向上させる上で有効であることを示している。提案手法は、ツール実行において100%の成功率を達成し、正確な回答一致において50%の精度を達成し、ベースラインを大幅に上回る結果を示した。ベースラインはツールを起動できず、正確な一致精度も20-30%に留まった。

English

This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL). Specifically, we focus on multi-turn tool-use scenarios, which can be naturally modeled as Markov Decision Processes (MDPs). While existing approaches often train multi-turn LLM agents with trajectory-level advantage estimation in bandit settings, they struggle with turn-level credit assignment across multiple decision steps, limiting their performance on multi-turn reasoning tasks. To address this, we introduce a fine-grained turn-level advantage estimation strategy to enable more precise credit assignment in multi-turn agent interactions. The strategy is general and can be incorporated into various RL algorithms such as Group Relative Preference Optimization (GRPO). Our experimental evaluation on multi-turn reasoning and search-based tool-use tasks with GRPO implementations highlights the effectiveness of the MDP framework and the turn-level credit assignment in advancing the multi-turn reasoning capabilities of LLM agents in complex decision-making settings. Our method achieves 100% success in tool execution and 50% accuracy in exact answer matching, significantly outperforming baselines, which fail to invoke tools and achieve only 20-30% exact match accuracy.

マルチターン推論を強化するためのLLMエージェントにおけるターンレベル信用割当

Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment

要旨

Support