現成的大型語言模型作為過程評分器：數學推理中過程獎勵模型的無需訓練替代方案

摘要

使用更强的评分器從多個小模型樣本中選取最佳回應，是一種簡單的推論期策略，但當小模型已陷入錯誤推理路徑時便會失效。PRM引導搜索透過在生成過程中對候選續寫進行評分來避免此問題，但需要使用具備逐步驟標籤的獎勵模型進行訓練。我們提出「分塊級引導生成」，這是一種無需訓練的替代方案，利用現成的大型語言模型作為過程評分器。在每一步中，小模型會採樣k個固定長度的候選分塊，而大型模型則透過似然對候選進行評分，無需生成任何文字。選定的分塊在下一步之前被確定，從而在錯誤傳播之前引導生成。我們以兩種選取規則實例化此框架：似然引導選擇（LGS），選取長度歸一化後大型模型對數機率最高的分塊；以及對比引導選擇（CGS），透過減去小模型的對數機率，傾向於選取大型模型偏好與小模型出現分歧的分塊。我們證明，使用大型模型似然對可變長度推理步驟進行評分並不可靠，因為存在即使經過長度歸一化仍無法消除的系統性長度偏差，而固定長度分塊則能避免此混淆因素。在GSM8K、MATH、Minerva Math、AMC23和AIME24上，以Qwen2.5-32B引導Qwen2.5-1.5B，以及以Llama-3.1-70B引導Llama-3.2-1B的實驗中，CGS在效能上超越多數投票最多達28個百分點；且在匹配的引導預算下，CGS在大多數基準測試中，無需獎勵模型訓練即可與Qwen2.5-Math-PRM-72B引導搜索相匹敵甚至超越。以Qwen2.5-72B引導Qwen2.5-7B時，CGS在k=16下於MATH達到81.8%、於Minerva Math達到63.6%，超越多數投票4至6個百分點。最後，分塊級引導生成產生的推理軌跡顯著短於PRM引導搜索。

English

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.