面向可靠高效N选一采样的上下文质量奖励模型

摘要

现代偏好对齐技术，如最佳N选一（BoN）采样，依赖于通过成对比较数据训练出的奖励模型。尽管这种方法在学习相对偏好方面效果显著，却未能捕捉到响应可接受性的信号，导致系统容易在众多不可接受的选项中选出最不差的那个。这对于难题提示尤为棘手，因为随着样本数量的增加，此类错误接受的风险也随之上升。本文针对这一关键可靠性缺口，提出了一种新的数据收集与建模框架。受离散选择模型启发，我们在偏好数据中引入外部选项，训练出一个不仅能区分何为更优、还能判断何为足够好的奖励模型。利用这一能力，我们创建了一种自适应推理策略——循环内最佳迷你N选一，它将生成预算划分为多个顺序循环，并配备经过校准的提前退出条件。实验表明，当作为对齐防护栏进行调优时，该策略将可靠性故障减少了70%；而在作为推理加速器调优时，在IMDB情感分析场景下，平均推理速度提升了超过22%。因此，我们为实践者提供了一个原则性强且灵活的框架，以明确管理可靠性与计算效率之间的权衡。

English

Modern preference alignment techniques, such as Best-of-N (BoN) sampling, rely on reward models trained with pairwise comparison data. While effective at learning relative preferences, this paradigm fails to capture a signal of response acceptability, leaving systems vulnerable to selecting the least bad of many unacceptable options. This is particularly problematic for hard prompts, where the risk of such false acceptances increases with the number of samples. In this paper, we address this critical reliability gap by introducing a new data collection and modeling framework. By augmenting preference data with an outside option, inspired by discrete choice models, we train a reward model that can distinguish not just what is better, but what is good enough. We leverage this capability to create an adaptive inference strategy, best of mini-N in-loop, which partitions the generation budget into sequential loops with a calibrated, early-exit condition. Our experiments show that when tuned as an alignment guardrail, it reduces reliability failures by 70\%, and when tuned as an inference accelerator, it improves average inference speed by over 22\% in IMDB-sentiment setting. We thus provide a principled and flexible framework for practitioners to explicitly manage the trade-off between reliability and computational efficiency.

面向可靠高效N选一采样的上下文质量奖励模型

A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling

摘要

Support