重新思考思維標記：大型語言模型作為改進運算元

摘要

推理訓練激勵大型語言模型（LLMs）產生長鏈的思維過程（長CoT），這不僅使它們能夠探索解決策略並進行自我檢查，還提高了準確性，但同時也增加了上下文長度、計算成本和回答延遲。我們提出疑問：當前模型能否利用其元認知能力，在帕累托前沿上提供其他組合，例如在降低上下文長度和/或延遲的情況下獲得更好的準確性？抽象地看，我們將模型視為對其自身“思維”進行改進的操作者，擁有一系列可能的策略。我們發現了一種有趣的推理家族——平行蒸餾精煉（PDR），其執行以下步驟：(i) 平行生成多樣化的草稿；(ii) 將它們蒸餾至一個有限的文本工作空間；(iii) 基於此工作空間進行精煉，產生輸出作為下一輪的起點。關鍵在於，上下文長度（從而計算成本）可通過平行度控制，不再與生成的總令牌數混淆。我們報告了當前模型的PDR實例，這些實例在降低延遲的同時，提供了比長CoT更高的準確性。將平行度設置為1，則得到一個有趣的子案例——順序精煉（SR）（迭代改進單一候選答案），其性能優於長CoT。此類模型協調的成功引發了進一步訓練是否能夠移動帕累托前沿的問題。為此，我們使用強化學習（RL）訓練了一個8B的思維模型，使其與PDR作為推理方法保持一致。在具有可驗證答案的數學任務中，迭代流程在匹配的順序預算下超越了單次通過的基線，其中PDR帶來了最大的增益（例如，在AIME 2024上提升11%，在AIME 2025上提升9%）。

English

Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own "thoughts" with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).

重新思考思維標記：大型語言模型作為改進運算元

Rethinking Thinking Tokens: LLMs as Improvement Operators

摘要

Support