強化學習在語言模型規劃中的優勢與陷阱：理論視角

摘要

近期的強化學習（RL）方法顯著提升了大型語言模型（LLMs）的規劃能力，然而其有效性的理論基礎仍不明確。在本研究中，我們通過一種可處理的基於圖的抽象來探討RL的優勢與局限，重點關注策略梯度（PG）和Q學習方法。我們的理論分析揭示，監督式微調（SFT）可能引入基於共現的虛假解，而RL則主要通過探索實現正確規劃，這凸顯了探索在促進更好泛化中的角色。然而，我們也發現PG存在多樣性崩潰的問題，即訓練過程中輸出多樣性下降，甚至在達到完美準確率後仍持續存在。相比之下，Q學習提供了兩個關鍵優勢：離策略學習和收斂時的多樣性保持。我們進一步證明，在Q學習中，精心設計獎勵是防止獎勵欺騙的必要條件。最後，將我們的框架應用於現實世界的規劃基準Blocksworld，我們確認了這些行為在實踐中的表現。

English

Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration's role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent reward hacking in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.

強化學習在語言模型規劃中的優勢與陷阱：理論視角

Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective

摘要

Support