价值引导搜索助力高效思维链推理
Value-Guided Search for Efficient Chain-of-Thought Reasoning
May 23, 2025
作者: Kaiwen Wang, Jin Peng Zhou, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kianté Brantley, Wen Sun
cs.AI
摘要
本文提出了一种针对长上下文推理轨迹进行价值模型训练的简洁高效方法。相较于现有的过程奖励模型(PRMs),我们的方法无需定义细粒度的“步骤”概念,这一概念在长上下文推理模型中往往难以界定。通过收集包含250万条推理轨迹的数据集,我们训练了一个15亿token级别的价值模型,并将其应用于DeepSeek模型,以提升测试时计算扩展的性能。研究发现,采用块级价值引导搜索(VGS)结合最终加权多数投票,在测试时扩展性上优于多数投票或最佳n项选择等标准方法。在64次生成的推理预算下,使用DeepSeek-R1-Distill-1.5B模型的VGS在四项数学竞赛基准(AIME 2024 & 2025, HMMT Feb 2024 & 2025)上平均准确率达到45.7%,与o3-mini-medium模型持平。此外,VGS显著降低了达到与多数投票相同性能所需的推理浮点运算次数。我们的数据集、模型及代码库均已开源。
English
In this paper, we propose a simple and efficient method for value model
training on long-context reasoning traces. Compared to existing process reward
models (PRMs), our method does not require a fine-grained notion of "step,"
which is difficult to define for long-context reasoning models. By collecting a
dataset of 2.5 million reasoning traces, we train a 1.5B token-level value
model and apply it to DeepSeek models for improved performance with test-time
compute scaling. We find that block-wise value-guided search (VGS) with a final
weighted majority vote achieves better test-time scaling than standard methods
such as majority voting or best-of-n. With an inference budget of 64
generations, VGS with DeepSeek-R1-Distill-1.5B achieves an average accuracy of
45.7% across four competition math benchmarks (AIME 2024 & 2025, HMMT Feb 2024
& 2025), reaching parity with o3-mini-medium. Moreover, VGS significantly
reduces the inference FLOPs required to achieve the same performance of
majority voting. Our dataset, model and codebase are open-sourced.Summary
AI-Generated Summary