ChatPaper.aiChatPaper

通過回合級別信用分配強化大型語言模型代理的多輪推理能力

Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment

May 17, 2025
作者: Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, Mingyi Hong
cs.AI

摘要

本研究探討了利用強化學習(Reinforcement Learning, RL)來提升大型語言模型(Large Language Model, LLM)代理推理能力的方法。具體而言,我們聚焦於多輪工具使用情境,這些情境可自然地建模為馬可夫決策過程(Markov Decision Processes, MDPs)。現有方法通常在多輪LLM代理的訓練中採用軌跡層面的優勢估計,然而在面對多個決策步驟時,這些方法難以實現輪次層面的信用分配,從而限制了其在多輪推理任務中的表現。為解決此問題,我們引入了一種細粒度的輪次層面優勢估計策略,以實現更精確的多輪代理互動信用分配。該策略具有通用性,可整合至多種RL算法中,例如群組相對偏好優化(Group Relative Preference Optimization, GRPO)。我們在多輪推理及基於搜索的工具使用任務中對GRPO實現進行了實驗評估,結果凸顯了MDP框架及輪次層面信用分配在提升LLM代理於複雜決策情境下多輪推理能力方面的有效性。我們的方法在工具執行上達到了100%的成功率,並在精確答案匹配上取得了50%的準確率,顯著超越了基線方法,後者未能成功調用工具且僅達到20-30%的精確匹配準確率。
English
This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL). Specifically, we focus on multi-turn tool-use scenarios, which can be naturally modeled as Markov Decision Processes (MDPs). While existing approaches often train multi-turn LLM agents with trajectory-level advantage estimation in bandit settings, they struggle with turn-level credit assignment across multiple decision steps, limiting their performance on multi-turn reasoning tasks. To address this, we introduce a fine-grained turn-level advantage estimation strategy to enable more precise credit assignment in multi-turn agent interactions. The strategy is general and can be incorporated into various RL algorithms such as Group Relative Preference Optimization (GRPO). Our experimental evaluation on multi-turn reasoning and search-based tool-use tasks with GRPO implementations highlights the effectiveness of the MDP framework and the turn-level credit assignment in advancing the multi-turn reasoning capabilities of LLM agents in complex decision-making settings. Our method achieves 100% success in tool execution and 50% accuracy in exact answer matching, significantly outperforming baselines, which fail to invoke tools and achieve only 20-30% exact match accuracy.

Summary

AI-Generated Summary

PDF132May 29, 2025