マルコフ的思考者

要旨

強化学習（RL）は最近、長い連鎖的思考（LongCoT）を生成する推論LLMを訓練するための強力な手法となっています。しかし、標準的なRLの「思考環境」では、状態がプロンプトとそれまでのすべての推論トークンで構成されるため、状態が無制限となり、思考が長くなるにつれて注意ベースのポリシーが二次的な計算コストを支払うことを強制します。我々はこの環境そのものを見直します。我々は、ポリシーが一定サイズの状態に条件付けながら推論を進める「マルコフ的思考」というパラダイムを提案します。これにより、思考の長さとコンテキストサイズが切り離され、線形計算と一定のメモリ使用量が実現されます。我々はこのアイデアを、推論を固定サイズのチャンクに構造化するRL環境「Delethink」として具体化します。各チャンク内では、モデルは通常通り思考しますが、境界では環境がコンテキストをリセットし、短いキャリーオーバーでプロンプトを再初期化します。RLを通じて、ポリシーは各チャンクの終わり近くに、リセット後もシームレスに推論を続けるのに十分なテキスト状態を書き込むことを学習します。この環境で訓練されたR1-Distill 1.5Bモデルは、8Kトークンのチャンクで推論を行いながら、最大24Kトークンまで思考し、24Kの予算で訓練されたLongCoT-RLに匹敵またはそれを上回ります。テスト時のスケーリングでは、LongCoTが頭打ちになる一方で、Delethinkは改善を続けます。線形計算の効果は大きく、平均96Kの思考長では、LongCoT-RLが27 H100-月のコストに対して、Delethinkは7 H100-月と推定されます。RL初期化時の分析では、既存の推論モデル（1.5B-120B）が多様なベンチマークでゼロショットでマルコフ的トレースをサンプリングすることが多く、RLが大規模で効果的であるためのポジティブサンプルを提供します。我々の結果は、思考環境を再設計することが非常に強力な手段であることを示しています：それは二次的なオーバーヘッドなしに非常に長い推論を可能にし、効率的でスケーラブルな推論LLMへの道を開きます。

English

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

マルコフ的思考者

The Markovian Thinker

要旨

Support