通过重引入马尔可夫状态突破大语言模型后训练的能力瓶颈

摘要

强化学习（RL）已成为大型语言模型（LLM）后训练与对齐的标准范式，但近期研究表明该方法存在持久的"能力天花板"：与能发现新策略的经典RL系统不同，用于LLM的RL往往仅能微调预训练权重中已有的潜在模式。本文指出一个根本性结构瓶颈：经典RL依赖紧凑且信息丰富的马尔可夫状态，而当前LLM后训练方案却受制于持续增长的行动历史记录。我们重新审视了长期居于RL理论核心却未在LLM后训练中应用的经典原则：显式马尔可夫状态。理论上，我们通过严格证明揭示了利用估计马尔可夫状态可显著降低样本复杂度。实证方面，我们在一系列复杂逻辑谜题任务中证明，引入马尔可夫状态能持续突破标准RL后训练的性能边界。研究结果表明，超越"以历史为状态"的建模方式，转向结构化马尔可夫表征，对于开启生成式AI的开放式发现及真正新颖的推理能力至关重要。

English

Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent "capability ceiling": unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond "history-as-state" modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.

通过重引入马尔可夫状态突破大语言模型后训练的能力瓶颈

Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

摘要

Support