信頼性と効率性を兼ね備えたBest-of-Nサンプリングのための文脈的品質報酬モデル

要旨

現代の選好整合技術、例えばBest-of-N（BoN）サンプリングは、ペアワイズ比較データを用いて訓練された報酬モデルに依存しています。相対的な選好を学習する点では効果的ですが、このパラダイムは応答の許容性を示す信号を捉えることができず、システムは多くの許容できない選択肢の中から最も悪くないものを選ぶリスクにさらされます。これは特に難しいプロンプトにおいて問題となり、そのような誤った許容のリスクはサンプル数と共に増加します。本論文では、この重要な信頼性のギャップを解決するため、新しいデータ収集とモデリングのフレームワークを導入します。離散選択モデルに着想を得て、選好データに外部オプションを追加することで、何がより良いかだけでなく、何が十分に良いかを識別できる報酬モデルを訓練します。この能力を活用し、生成予算を順次ループに分割し、調整された早期終了条件を持つ適応的推論戦略「best of mini-N in-loop」を作成します。実験結果から、整合ガードレールとして調整した場合、信頼性の失敗を70％削減し、推論加速器として調整した場合、IMDBセンチメント設定において平均推論速度を22％以上向上させることが示されました。これにより、実践者が信頼性と計算効率のトレードオフを明示的に管理するための原理的で柔軟なフレームワークを提供します。

English

Modern preference alignment techniques, such as Best-of-N (BoN) sampling, rely on reward models trained with pairwise comparison data. While effective at learning relative preferences, this paradigm fails to capture a signal of response acceptability, leaving systems vulnerable to selecting the least bad of many unacceptable options. This is particularly problematic for hard prompts, where the risk of such false acceptances increases with the number of samples. In this paper, we address this critical reliability gap by introducing a new data collection and modeling framework. By augmenting preference data with an outside option, inspired by discrete choice models, we train a reward model that can distinguish not just what is better, but what is good enough. We leverage this capability to create an adaptive inference strategy, best of mini-N in-loop, which partitions the generation budget into sequential loops with a calibrated, early-exit condition. Our experiments show that when tuned as an alignment guardrail, it reduces reliability failures by 70\%, and when tuned as an inference accelerator, it improves average inference speed by over 22\% in IMDB-sentiment setting. We thus provide a principled and flexible framework for practitioners to explicitly manage the trade-off between reliability and computational efficiency.

信頼性と効率性を兼ね備えたBest-of-Nサンプリングのための文脈的品質報酬モデル

A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling

要旨

Support