言語モデルプランニングにおける強化学習の利点と落とし穴：理論的視点

要旨

近年の強化学習（RL）手法は、大規模言語モデル（LLM）の計画能力を大幅に向上させてきたが、その有効性の理論的基盤は未だ明らかではない。本研究では、グラフベースの抽象化を通じてRLの利点と限界を調査し、特に方策勾配法（PG）とQ学習法に焦点を当てる。理論分析の結果、教師ありファインチューニング（SFT）は共起に基づく疑似解を導入する可能性があるのに対し、RLは主に探索を通じて正しい計画を達成し、より良い汎化を可能にする上で探索の役割を強調することが明らかになった。しかしながら、PGは多様性の崩壊に悩まされ、訓練中に出力の多様性が減少し、完全な精度が達成された後もその状態が持続することを示す。一方、Q学習はオフポリシー学習と収束時の多様性保持という二つの重要な利点を提供する。さらに、Q学習において報酬ハッキングを防ぐためには、慎重な報酬設計が必要であることを示す。最後に、現実世界の計画ベンチマークであるBlocksworldに本フレームワークを適用し、これらの挙動が実際に現れることを確認する。

English

Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration's role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent reward hacking in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.

言語モデルプランニングにおける強化学習の利点と落とし穴：理論的視点

Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective

要旨

Support