AdaptiveStep:基於模型置信度自動劃分推理步驟
AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence
February 19, 2025
作者: Yuliang Liu, Junjie Lu, Zhaoling Chen, Chaofeng Qu, Jason Klein Liu, Chonghan Liu, Zefan Cai, Yunhui Xia, Li Zhao, Jiang Bian, Chuheng Zhang, Wei Shen, Zhouhan Lin
cs.AI
摘要
目前訓練過程獎勵模型(PRMs)的方法通常涉及使用基於規則的技術將回應分解為多個推理步驟,例如使用預定義的佔位符標記或將推理步驟的長度設定為固定大小。這些方法忽略了特定詞語通常不會在文本中標示真正的決策點這一事實。為解決此問題,我們提出了AdaptiveStep,這是一種根據模型預測下一個詞的置信度來劃分推理步驟的方法。這種劃分方法在每一步提供了更多的決策信息,從而增強了下游任務,如獎勵模型學習。此外,我們的方法無需手動註釋。我們通過在數學推理和代碼生成任務中使用AdaptiveStep訓練的PRMs進行實驗,展示了其有效性。實驗結果表明,最終的PRM在Best-of-N性能上達到了最先進的水平,超越了基於詞元級別值引導解碼的貪婪搜索策略,同時與現有的開源PRMs相比,構建成本降低了超過30%。此外,我們還對PRM的性能、可遷移性和泛化能力進行了深入分析和案例研究。
English
Current approaches for training Process Reward Models (PRMs) often involve
breaking down responses into multiple reasoning steps using rule-based
techniques, such as using predefined placeholder tokens or setting the
reasoning step's length into a fixed size. These approaches overlook the fact
that specific words do not typically mark true decision points in a text. To
address this, we propose AdaptiveStep, a method that divides reasoning steps
based on the model's confidence in predicting the next word. This division
method provides more decision-making information at each step, enhancing
downstream tasks, such as reward model learning. Moreover, our method does not
require manual annotation. We demonstrate its effectiveness through experiments
with AdaptiveStep-trained PRMs in mathematical reasoning and code generation
tasks. Experimental results indicate that the outcome PRM achieves
state-of-the-art Best-of-N performance, surpassing greedy search strategy with
token-level value-guided decoding, while also reducing construction costs by
over 30% compared to existing open-source PRMs. In addition, we provide a
thorough analysis and case study on the PRM's performance, transferability, and
generalization capabilities.Summary
AI-Generated Summary