价值引导搜索助力高效思维链推理

摘要

本文提出了一种针对长上下文推理轨迹进行价值模型训练的简洁高效方法。相较于现有的过程奖励模型（PRMs），我们的方法无需定义细粒度的“步骤”概念，这一概念在长上下文推理模型中往往难以界定。通过收集包含250万条推理轨迹的数据集，我们训练了一个15亿token级别的价值模型，并将其应用于DeepSeek模型，以提升测试时计算扩展的性能。研究发现，采用块级价值引导搜索（VGS）结合最终加权多数投票，在测试时扩展性上优于多数投票或最佳n项选择等标准方法。在64次生成的推理预算下，使用DeepSeek-R1-Distill-1.5B模型的VGS在四项数学竞赛基准（AIME 2024 & 2025, HMMT Feb 2024 & 2025）上平均准确率达到45.7%，与o3-mini-medium模型持平。此外，VGS显著降低了达到与多数投票相同性能所需的推理浮点运算次数。我们的数据集、模型及代码库均已开源。

English

In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. With an inference budget of 64 generations, VGS with DeepSeek-R1-Distill-1.5B achieves an average accuracy of 45.7% across four competition math benchmarks (AIME 2024 & 2025, HMMT Feb 2024 & 2025), reaching parity with o3-mini-medium. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.

价值引导搜索助力高效思维链推理

Value-Guided Search for Efficient Chain-of-Thought Reasoning

摘要

Support