LLM은 탐욕적 에이전트: 의사결정 능력에 미치는 RL 파인튜닝의 효과

초록

대규모 언어 모델(LLM)의 성공은 다양한 에이전트 응용 분야에 대한 관심을 불러일으켰습니다. 핵심 가설은 LLM이 상식과 사고 연쇄(CoT) 추론을 활용하여 복잡한 도메인을 효과적으로 탐색하고 효율적으로 해결할 수 있다는 것입니다. 그러나 LLM 에이전트는 최적이 아닌 탐색과 지식-행동 간극(knowing-doing gap), 즉 모델 내에 존재하는 지식을 효과적으로 행동으로 옮기지 못하는 문제를 겪는 것으로 나타났습니다. 본 연구에서는 LLM이 의사결정 시나리오에서 최적이 아닌 성능을 보이는 이유를 체계적으로 분석합니다. 특히, 탐욕성(greediness), 빈도 편향(frequency bias), 그리고 지식-행동 간극이라는 세 가지 주요 실패 모드를 면밀히 검토합니다. 우리는 이러한 단점을 완화하기 위해 자체 생성된 CoT 논리를 기반으로 강화 학습(RL)을 통한 미세 조정(fine-tuning)을 제안합니다. 멀티-암드 밴딧, 컨텍스트 밴딧, 틱택토 등 다양한 실험을 통해 RL 미세 조정이 탐색을 증가시키고 지식-행동 간극을 좁히는 방식으로 LLM의 의사결정 능력을 향상시킨다는 것을 입증합니다. 마지막으로, 우리는 엡실론-탐욕(epsilon-greedy)과 같은 고전적인 탐색 메커니즘과 자기 수정(self-correction) 및 자기 일관성(self-consistency)과 같은 LLM 특화 접근법을 연구하여 LLM의 의사결정을 위한 더 효과적인 미세 조정을 가능하게 합니다.

English

The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as epsilon-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.

LLM은 탐욕적 에이전트: 의사결정 능력에 미치는 RL 파인튜닝의 효과

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

초록

Support