ChatPaper.aiChatPaper

现成大语言模型作为过程评分器:数学推理中无需训练的过程奖励模型替代方案

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

June 1, 2026
作者: Atoosa Chegini, Soheil Feizi
cs.AI

摘要

使用更强的评分模型从多个小模型样本中选择最佳响应是一种简单的推理时策略,但当小模型已陷入错误推理路径时便会失效。PRM引导搜索通过在生成过程中对候选续写进行评分来避免这一问题,但需要基于步骤级标签训练奖励模型。 我们提出**块级引导生成**(Chunk-Level Guided Generation),这是一种无需训练的替代方案,可直接利用现成的大型语言模型作为过程评分器。在每一步中,小模型采样k个固定长度的候选块,大模型则基于似然性对候选块进行评分而不生成任何文本。被选中的块在下一步之前被确定,从而在错误传播之前引导生成过程。 我们通过两种选择规则实例化该框架:**似然引导选择**(LGS),选择长度归一化后大模型对数概率最高的块;以及**对比引导选择**(CGS),通过减去小模型的对数概率来优先选择大模型偏好与小模型存在差异的块。我们证明,使用大模型似然性对变长推理步骤进行评分由于存在系统性的长度偏差(即使经过长度归一化后仍存在)而不可靠,而固定长度块可避免这一干扰因素。 在GSM8K、MATH、Minerva Math、AMC23和AIME24数据集上,使用Qwen2.5-1.5B由Qwen2.5-32B引导、以及Llama-3.2-1B由Llama-3.1-70B引导的实验中,CGS相比多数投票方法性能提升最多达28个百分点;在匹配的引导预算下,其性能在多数基准测试中达到或超越了使用Qwen2.5-Math-PRM-72B引导搜索的方法(无需奖励模型训练)。当使用Qwen2.5-7B由Qwen2.5-72B引导时,CGS在k=16条件下于MATH和Minerva Math上分别达到81.8%和63.6%的准确率,相比多数投票方法提升4-6个百分点。最后,块级引导生成产生的推理路径长度远短于PRM引导搜索。
English
Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.