突破大语言模型训练后能力上限：重引入马尔可夫状态的方法

摘要

强化学习（RL）已成为大语言模型（LLM）后训练与对齐的标准范式，但近期研究表明其面临顽固的"能力天花板"：与能发现新策略的经典RL系统不同，用于LLM的RL往往仅充当预训练权重中潜在模式的微调工具。本文指出一个根本性结构瓶颈：经典RL依赖紧凑且信息丰富的马尔可夫状态，而当前LLM后训练方案却受制于持续增长的动作历史序列。我们重新审视了长期居于RL理论核心却缺席于LLM后训练的经典原则：显式马尔可夫状态。理论上，我们严格证明了利用估计的马尔可夫状态可显著降低样本复杂度。实证方面，通过一系列复杂逻辑谜题实验，我们发现引入马尔可夫状态能持续突破标准RL后训练的性能边界。研究结果表明，摆脱"以历史为状态"的建模方式，转向结构化马尔可夫表征，对于释放生成式AI的开放式发现能力及真正新颖的推理潜能具有关键意义。

English

Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent "capability ceiling": unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond "history-as-state" modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.