ChatPaper.aiChatPaper

通过轮次级信用分配强化LLM代理的多轮推理能力

Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment

May 17, 2025
作者: Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, Mingyi Hong
cs.AI

摘要

本文探讨了利用强化学习(RL)增强大型语言模型(LLM)代理推理能力的方法。具体而言,我们聚焦于多轮工具使用场景,这类场景可自然建模为马尔可夫决策过程(MDPs)。现有方法通常在赌博机设置下通过轨迹级优势估计来训练多轮LLM代理,但在跨多个决策步骤的轮次级信用分配上存在困难,限制了其在多轮推理任务上的表现。为解决这一问题,我们引入了一种细粒度的轮次级优势估计策略,以实现多轮代理交互中更精确的信用分配。该策略具有通用性,可融入多种RL算法,如群体相对偏好优化(GRPO)。通过在GRPO实现的多轮推理和基于搜索的工具使用任务上的实验评估,我们验证了MDP框架及轮次级信用分配在提升LLM代理于复杂决策环境中多轮推理能力方面的有效性。我们的方法在工具执行上实现了100%的成功率,在精确答案匹配上达到了50%的准确率,显著优于基线方法,后者未能成功调用工具且仅实现了20-30%的精确匹配准确率。
English
This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL). Specifically, we focus on multi-turn tool-use scenarios, which can be naturally modeled as Markov Decision Processes (MDPs). While existing approaches often train multi-turn LLM agents with trajectory-level advantage estimation in bandit settings, they struggle with turn-level credit assignment across multiple decision steps, limiting their performance on multi-turn reasoning tasks. To address this, we introduce a fine-grained turn-level advantage estimation strategy to enable more precise credit assignment in multi-turn agent interactions. The strategy is general and can be incorporated into various RL algorithms such as Group Relative Preference Optimization (GRPO). Our experimental evaluation on multi-turn reasoning and search-based tool-use tasks with GRPO implementations highlights the effectiveness of the MDP framework and the turn-level credit assignment in advancing the multi-turn reasoning capabilities of LLM agents in complex decision-making settings. Our method achieves 100% success in tool execution and 50% accuracy in exact answer matching, significantly outperforming baselines, which fail to invoke tools and achieve only 20-30% exact match accuracy.

Summary

AI-Generated Summary

PDF132May 29, 2025