情境品質獎勵模型：實現可靠且高效的最佳N選取採樣

摘要

現代偏好對齊技術，如最佳N選取（BoN）採樣，依賴於基於成對比較數據訓練的獎勵模型。雖然這種方法在學習相對偏好方面效果顯著，但它未能捕捉到回應可接受性的信號，使得系統容易在眾多不可接受的選項中選擇最不壞的那個。這對於困難提示尤其成問題，因為隨著樣本數量的增加，此類錯誤接受的風險也會上升。本文通過引入一種新的數據收集與建模框架，來解決這一關鍵的可靠性缺口。受離散選擇模型的啟發，我們通過在偏好數據中加入外部選項，訓練了一個不僅能區分哪個更好，還能判斷什麼是足夠好的獎勵模型。我們利用這一能力創建了一種自適應推理策略——循環內最佳迷你N選取，該策略將生成預算劃分為多個順序循環，並配備了校準的提前退出條件。實驗表明，當作為對齊護欄進行調優時，它將可靠性故障減少了70%；而作為推理加速器調優時，在IMDB情感分析設置中，平均推理速度提升了超過22%。因此，我們為實踐者提供了一個原則性且靈活的框架，以明確管理可靠性與計算效率之間的權衡。

English

Modern preference alignment techniques, such as Best-of-N (BoN) sampling, rely on reward models trained with pairwise comparison data. While effective at learning relative preferences, this paradigm fails to capture a signal of response acceptability, leaving systems vulnerable to selecting the least bad of many unacceptable options. This is particularly problematic for hard prompts, where the risk of such false acceptances increases with the number of samples. In this paper, we address this critical reliability gap by introducing a new data collection and modeling framework. By augmenting preference data with an outside option, inspired by discrete choice models, we train a reward model that can distinguish not just what is better, but what is good enough. We leverage this capability to create an adaptive inference strategy, best of mini-N in-loop, which partitions the generation budget into sequential loops with a calibrated, early-exit condition. Our experiments show that when tuned as an alignment guardrail, it reduces reliability failures by 70\%, and when tuned as an inference accelerator, it improves average inference speed by over 22\% in IMDB-sentiment setting. We thus provide a principled and flexible framework for practitioners to explicitly manage the trade-off between reliability and computational efficiency.

情境品質獎勵模型：實現可靠且高效的最佳N選取採樣

A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling

摘要

Support