馬可夫思考者

摘要

強化學習（RL）最近已成為訓練能夠產生長鏈思維（LongCoT）的推理大型語言模型（LLM）的強大方法。然而，標準的RL“思考環境”，其中狀態是提示加上所有先前的推理標記，使得狀態無界，並迫使基於注意力的策略在思維延長時支付二次方的計算成本。我們重新審視了環境本身。我們提出了馬爾可夫思維，這是一種範式，其中策略在條件於固定大小狀態的情況下推進推理，從而將思維長度與上下文大小解耦。這立即帶來了線性計算和恆定內存的好處。我們通過Delethink實例化了這一想法，這是一個將推理結構化為固定大小塊的RL環境。在每個塊內，模型像往常一樣思考；在邊界處，環境重置上下文並用簡短的延續重新初始化提示。通過RL，策略學會在每個塊的末尾寫入足夠的文本狀態，以便在重置後無縫地繼續推理。在這種環境中訓練的R1-Distill 1.5B模型在8K標記的塊中進行推理，但思維長度可達24K標記，與使用24K預算訓練的LongCoT-RL相當或超越。在測試時擴展中，Delethink在LongCoT停滯的地方繼續改進。線性計算的效果顯著：我們經驗估計在96K平均思維長度下，LongCoT-RL的成本為27個H100月，而Delethink僅為7個。在RL初始化時的分析顯示，現成的推理模型（1.5B-120B）通常在不同基準上零樣本採樣馬爾可夫軌跡，提供了使RL在大規模上有效的正樣本。我們的結果表明，重新設計思考環境是一個強大的槓桿：它使極長推理無需二次方開銷，並為高效、可擴展的推理LLM開闢了道路。

English

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

馬可夫思考者

The Markovian Thinker

摘要

Support