语言模型规划中强化学习的优势与陷阱:理论视角
Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective
September 26, 2025
作者: Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, Wei Chen
cs.AI
摘要
近期强化学习(RL)方法显著提升了大型语言模型(LLMs)的规划能力,但其有效性的理论基础仍不明确。本研究通过一种可处理的基于图的抽象模型,探讨了RL的优势与局限,重点关注策略梯度(PG)和Q学习方法。我们的理论分析表明,监督微调(SFT)可能引入基于共现的伪解,而RL则主要通过探索实现正确规划,凸显了探索在促进更好泛化中的关键作用。然而,我们也发现PG存在多样性崩溃问题,即训练过程中输出多样性下降,甚至在达到完美准确率后依然持续。相比之下,Q学习具备两大优势:离策略学习及收敛时的多样性保持。我们进一步证明,为防止Q学习中的奖励欺骗,精心设计奖励机制是必要的。最后,将我们的框架应用于现实世界规划基准Blocksworld,我们证实了这些行为在实际中的显现。
English
Recent reinforcement learning (RL) methods have substantially enhanced the
planning capabilities of Large Language Models (LLMs), yet the theoretical
basis for their effectiveness remains elusive. In this work, we investigate
RL's benefits and limitations through a tractable graph-based abstraction,
focusing on policy gradient (PG) and Q-learning methods. Our theoretical
analyses reveal that supervised fine-tuning (SFT) may introduce
co-occurrence-based spurious solutions, whereas RL achieves correct planning
primarily through exploration, underscoring exploration's role in enabling
better generalization. However, we also show that PG suffers from diversity
collapse, where output diversity decreases during training and persists even
after perfect accuracy is attained. By contrast, Q-learning provides two key
advantages: off-policy learning and diversity preservation at convergence. We
further demonstrate that careful reward design is necessary to prevent reward
hacking in Q-learning. Finally, applying our framework to the real-world
planning benchmark Blocksworld, we confirm that these behaviors manifest in
practice.