思考トークンの再考：改善演算子としての大規模言語モデル

要旨

推論トレーニングは、大規模言語モデル（LLM）に長い思考連鎖（long CoT）を生成するよう促し、その中でも特に、自己チェックを伴う解決策の探索を可能にします。これにより、精度が向上する一方で、コンテキストの長さ、トークン/計算コスト、および回答の遅延が増大します。我々は問います：現在のモデルは、そのメタ認知を活用して、このパレートフロンティア上の他の組み合わせ、例えば、より低いコンテキスト長や遅延でより高い精度を提供できるでしょうか？抽象的に、我々はモデルを、その「思考」に対する改善オペレータとして捉え、可能な戦略の連続体を考えます。我々は、興味深い推論ファミリーであるParallel-Distill-Refine（PDR）を特定し、以下の手順を実行します：（i）並列的に多様なドラフトを生成する；（ii）それらを限定的なテキストワークスペースに蒸留する；（iii）このワークスペースに基づいて精緻化し、次のラウンドの種となる出力を生成する。重要な点として、コンテキストの長さ（したがって計算コスト）は並列度によって制御可能であり、生成されたトークンの総数と混同されることはありません。我々は、long CoTよりも高い精度を提供しつつ、より低い遅延を招く、現在のモデルのPDRインスタンスを報告します。並列度を1に設定すると、興味深いサブケースであるSequential Refinement（SR）（単一の候補回答を反復的に改善する）が得られ、long CoTを上回る性能を提供します。このようなモデルオーケストレーションの成功は、さらなるトレーニングがパレートフロンティアをシフトさせ得るかという疑問を提起します。この目的のために、我々は8Bの思考モデルを強化学習（RL）でトレーニングし、PDRを推論方法として一貫させるようにしました。検証可能な回答を伴う数学タスクにおいて、反復的パイプラインは、同じ逐次予算で単一パスのベースラインを上回り、PDRが最大の利益をもたらしました（例：AIME 2024で+11%、AIME 2025で+9%）。

English

Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own "thoughts" with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).

思考トークンの再考：改善演算子としての大規模言語モデル

Rethinking Thinking Tokens: LLMs as Improvement Operators

要旨

Support