턴 단위 신용 할당을 통한 LLM 에이전트의 다중 턴 추론 강화

초록

본 논문은 강화학습(Reinforcement Learning, RL)을 활용하여 대규모 언어 모델(Large Language Model, LLM) 에이전트의 추론 능력을 향상시키는 접근법을 탐구한다. 특히, 우리는 마르코프 결정 과정(Markov Decision Process, MDP)으로 자연스럽게 모델링될 수 있는 다중 턴 도구 사용 시나리오에 초점을 맞춘다. 기존 접근법들은 주로 밴딧 설정에서 궤적 수준의 이점 추정(trajectory-level advantage estimation)을 통해 다중 턴 LLM 에이전트를 훈련시키지만, 여러 결정 단계에 걸친 턴 수준의 신용 할당(turn-level credit assignment)에 어려움을 겪으며, 이는 다중 턴 추론 과제에서의 성능을 제한한다. 이를 해결하기 위해, 우리는 다중 턴 에이전트 상호작용에서 보다 정확한 신용 할당을 가능하게 하는 세분화된 턴 수준의 이점 추정 전략을 제안한다. 이 전략은 일반적이며, 그룹 상대 선호 최적화(Group Relative Preference Optimization, GRPO)와 같은 다양한 RL 알고리즘에 통합될 수 있다. GRPO 구현을 통한 다중 턴 추론 및 검색 기반 도구 사용 과제에 대한 실험적 평가는 MDP 프레임워크와 턴 수준 신용 할당이 복잡한 의사결정 환경에서 LLM 에이전트의 다중 턴 추론 능력을 발전시키는 데 효과적임을 보여준다. 우리의 방법은 도구 실행에서 100%의 성공률과 정확한 답변 일치에서 50%의 정확도를 달성하며, 도구를 호출하지 못하고 단 20-30%의 정확한 일치 정확도를 보이는 기준선을 크게 능가한다.

English

This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL). Specifically, we focus on multi-turn tool-use scenarios, which can be naturally modeled as Markov Decision Processes (MDPs). While existing approaches often train multi-turn LLM agents with trajectory-level advantage estimation in bandit settings, they struggle with turn-level credit assignment across multiple decision steps, limiting their performance on multi-turn reasoning tasks. To address this, we introduce a fine-grained turn-level advantage estimation strategy to enable more precise credit assignment in multi-turn agent interactions. The strategy is general and can be incorporated into various RL algorithms such as Group Relative Preference Optimization (GRPO). Our experimental evaluation on multi-turn reasoning and search-based tool-use tasks with GRPO implementations highlights the effectiveness of the MDP framework and the turn-level credit assignment in advancing the multi-turn reasoning capabilities of LLM agents in complex decision-making settings. Our method achieves 100% success in tool execution and 50% accuracy in exact answer matching, significantly outperforming baselines, which fail to invoke tools and achieve only 20-30% exact match accuracy.

턴 단위 신용 할당을 통한 LLM 에이전트의 다중 턴 추론 강화

Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment

초록

Support