언어 모델 계획을 위한 강화 학습의 이점과 함정: 이론적 관점

초록

최근 강화 학습(RL) 방법들은 대규모 언어 모델(LLMs)의 계획 능력을 크게 향상시켰지만, 그 효과에 대한 이론적 근거는 여전히 명확하지 않습니다. 본 연구에서는 그래프 기반의 추상화를 통해 RL의 이점과 한계를 조사하며, 특히 정책 경사(PG)와 Q-학습 방법에 초점을 맞춥니다. 우리의 이론적 분석은 지도 미세 조정(SFT)이 공기반의 허위 해결책을 도입할 수 있는 반면, RL은 주로 탐색을 통해 올바른 계획을 달성하며, 이는 더 나은 일반화를 가능하게 하는 탐색의 역할을 강조합니다. 그러나 PG는 다양성 붕괴 문제를 겪는데, 이는 훈련 중 출력 다양성이 감소하고 완벽한 정확도 달성 후에도 지속되는 현상입니다. 반면, Q-학습은 오프-폴리시 학습과 수렴 시 다양성 보존이라는 두 가지 주요 이점을 제공합니다. 또한, Q-학습에서 보상 해킹을 방지하기 위해서는 신중한 보상 설계가 필요함을 보여줍니다. 마지막으로, 실제 계획 벤치마크인 Blocksworld에 우리의 프레임워크를 적용하여 이러한 행동들이 실제로 나타남을 확인합니다.

English

Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration's role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent reward hacking in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.

언어 모델 계획을 위한 강화 학습의 이점과 함정: 이론적 관점

Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective

초록

Support