马尔可夫思维者

摘要

强化学习（RL）近期已成为训练能够产生长链思维（LongCoT）的推理大语言模型（LLMs）的有效方法。然而，标准的RL“思考环境”中，状态由提示词及所有先前的推理标记构成，导致状态无界，迫使基于注意力的策略随着思维链的延长而承受二次方的计算负担。我们重新审视了环境本身，提出了马尔可夫思维范式，在此范式中，策略在固定大小的状态下推进推理，将思维长度与上下文规模解耦。这一变革直接带来了线性计算与恒定内存的收益。我们通过Delethink这一RL环境具体实现了这一理念，它将推理结构化为固定大小的片段。在每个片段内，模型如常进行思考；在片段边界，环境重置上下文并以简短的信息传递重新初始化提示。通过RL训练，策略学会在每段末尾书写足以无缝延续推理的文本状态。在此环境中训练的R1-Distill 1.5B模型，在8K标记的片段内进行推理，却能思考长达24K标记，与使用24K预算训练的LongCoT-RL相媲美甚至超越。随着测试规模的扩大，Delethink持续改进，而LongCoT则趋于平稳。线性计算的效果显著：我们实证估计，在平均96K思维长度下，LongCoT-RL需耗费27个H100月，而Delethink仅需7个。RL初始化阶段的分析显示，现成的推理模型（1.5B至120B）在多种基准测试中常能零样本生成马尔可夫轨迹，为大规模RL提供了有效正样本。我们的结果表明，重新设计思考环境是一个强有力的杠杆：它支持极长推理而不引入二次方开销，为高效、可扩展的推理LLMs开辟了道路。

English

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

马尔可夫思维者

The Markovian Thinker

摘要

Support