AdaptiveStep: モデルの信頼度に基づく推論ステップの自動分割

要旨

現在のプロセス報酬モデル（PRM）のトレーニング手法では、多くの場合、事前に定義されたプレースホルダートークンの使用や、推論ステップの長さを固定サイズに設定するなど、ルールベースの技術を用いて応答を複数の推論ステップに分割しています。これらの手法は、特定の単語がテキスト内の真の意思決定ポイントを通常は示さないという事実を見落としています。この問題に対処するため、我々はAdaptiveStepを提案します。これは、次の単語を予測する際のモデルの信頼度に基づいて推論ステップを分割する方法です。この分割方法により、各ステップでより多くの意思決定情報が提供され、報酬モデルの学習などの下流タスクが強化されます。さらに、我々の手法は手動のアノテーションを必要としません。数学的推論とコード生成タスクにおけるAdaptiveStepでトレーニングされたPRMの実験を通じて、その有効性を実証します。実験結果は、結果として得られたPRMが、トークンレベルの価値誘導デコードを用いた貪欲探索戦略を上回る、最先端のBest-of-N性能を達成し、既存のオープンソースPRMと比較して構築コストを30%以上削減することを示しています。さらに、PRMの性能、転移性、および汎化能力に関する詳細な分析とケーススタディを提供します。

English

Current approaches for training Process Reward Models (PRMs) often involve breaking down responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step's length into a fixed size. These approaches overlook the fact that specific words do not typically mark true decision points in a text. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word. This division method provides more decision-making information at each step, enhancing downstream tasks, such as reward model learning. Moreover, our method does not require manual annotation. We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks. Experimental results indicate that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. In addition, we provide a thorough analysis and case study on the PRM's performance, transferability, and generalization capabilities.

AdaptiveStep: モデルの信頼度に基づく推論ステップの自動分割

AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence

要旨

Support