重新思考思维标记：将大语言模型视为优化算子

摘要

推理训练激励大型语言模型（LLMs）生成长链思维（长CoT），这使它们能够通过自我检查探索解决策略。虽然这提高了准确性，但也增加了上下文长度、令牌/计算成本和答案延迟。我们提出疑问：当前模型能否利用其元认知能力，在这帕累托前沿上提供其他组合，例如在降低上下文长度和/或延迟的同时获得更好的准确性？抽象地看，我们将模型视为对其自身“思维”的改进算子，拥有连续的可能策略。我们识别出一个有趣的推理家族——并行蒸馏精炼（PDR），其执行以下步骤：(i) 并行生成多样化的草稿；(ii) 将它们蒸馏到一个有界的文本工作空间；(iii) 基于此工作空间进行精炼，产生一个输出作为下一轮的种子。重要的是，上下文长度（因此计算成本）可通过并行度控制，不再与生成的令牌总数混淆。我们报告了当前模型的PDR实例，在降低延迟的同时，其准确性优于长CoT。将并行度设置为1，则得到一个有趣的子案例——顺序精炼（SR）（迭代改进单个候选答案），其性能优于长CoT。此类模型编排的成功引发了一个问题：进一步训练是否能够移动帕累托前沿。为此，我们使用强化学习（RL）训练了一个80亿参数的思维模型，使其与PDR作为推理方法保持一致。在具有可验证答案的数学任务上，迭代管道在匹配的顺序预算下超越了单次基线，其中PDR带来了最大的增益（例如，在AIME 2024上提升11%，在AIME 2025上提升9%）。

English

Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own "thoughts" with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).

重新思考思维标记：将大语言模型视为优化算子

Rethinking Thinking Tokens: LLMs as Improvement Operators

摘要

Support