ポストトレーニングにおける大規模言語モデルの能力限界をマルコフ状態の再導入によって打破する

要旨

強化学習（RL）は大規模言語モデル（LLM）の事後学習およびアラインメントにおける標準的なパラダイムとなっているが、最近の知見は、RLが頑固な「能力限界」に直面していることを示唆している。すなわち、新たな戦略を発見する古典的なRLシステムとは異なり、LLMに対するRLは、事前学習済みの重みに潜在的に存在するパターンの単なる洗練装置として機能することが多い。本研究では、その根本的な構造的ボトルネックを特定する。古典的なRLがコンパクトで情報量の多いマルコフ状態に依存するのに対し、現在のLLM事後学習の定式化は、際限なく拡大する行動履歴に縛られているのである。我々は、長らくRLの中核でありながらLLM事後学習では欠如していた古典的原理、すなわち明示的なマルコフ状態を再考する。理論的には、推定されたマルコフ状態を活用することでサンプル複雑性を大幅に低減できることを示す厳密な保証を提供する。実証的には、一連の複雑な論理パズルにおいて、マルコフ状態を導入することが標準的なRL事後学習の性能限界を一貫して打破することを示す。我々の発見は、「履歴を状態とする」モデリングを超えて、構造化されたマルコフ的表現を採用することが、生成AIにおけるオープンエンドな発見と真に新しい推論能力を解き放つために不可欠であることを示唆している。

English

Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent "capability ceiling": unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond "history-as-state" modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.

ポストトレーニングにおける大規模言語モデルの能力限界をマルコフ状態の再導入によって打破する

Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

要旨

Support