신뢰할 수 있고 효율적인 N-최선 샘플링을 위한 상황적 품질 보상 모델

초록

현대의 선호 정렬 기법, 예를 들어 Best-of-N(BoN) 샘플링은 쌍별 비교 데이터로 훈련된 보상 모델에 의존합니다. 상대적 선호도를 학습하는 데는 효과적이지만, 이 패러다임은 응답의 수용 가능성을 포착하지 못해 시스템이 여러 수용 불가능한 옵션 중에서 최악의 것을 선택할 위험에 노출됩니다. 이는 특히 어려운 프롬프트에서 더욱 문제가 되는데, 이러한 잘못된 수용의 위험은 샘플 수가 증가함에 따라 커집니다. 본 논문에서는 이러한 중요한 신뢰성 격차를 해결하기 위해 새로운 데이터 수집 및 모델링 프레임워크를 소개합니다. 이산 선택 모델에서 영감을 받아 외부 옵션을 선호 데이터에 추가함으로써, 무엇이 더 나은지뿐만 아니라 무엇이 충분히 좋은지를 구별할 수 있는 보상 모델을 훈련시킵니다. 우리는 이 능력을 활용하여 생성 예산을 순차적 루프로 분할하고 조정된 조기 종료 조건을 갖춘 적응형 추론 전략인 best of mini-N in-loop를 개발했습니다. 실험 결과, 정렬 가드레일로 조정했을 때 신뢰성 실패를 70% 감소시키고, 추론 가속기로 조정했을 때 IMDB 감정 설정에서 평균 추론 속도를 22% 이상 향상시켰습니다. 이를 통해 실무자들이 신뢰성과 계산 효율성 사이의 균형을 명시적으로 관리할 수 있는 원칙적이고 유연한 프레임워크를 제공합니다.

English

Modern preference alignment techniques, such as Best-of-N (BoN) sampling, rely on reward models trained with pairwise comparison data. While effective at learning relative preferences, this paradigm fails to capture a signal of response acceptability, leaving systems vulnerable to selecting the least bad of many unacceptable options. This is particularly problematic for hard prompts, where the risk of such false acceptances increases with the number of samples. In this paper, we address this critical reliability gap by introducing a new data collection and modeling framework. By augmenting preference data with an outside option, inspired by discrete choice models, we train a reward model that can distinguish not just what is better, but what is good enough. We leverage this capability to create an adaptive inference strategy, best of mini-N in-loop, which partitions the generation budget into sequential loops with a calibrated, early-exit condition. Our experiments show that when tuned as an alignment guardrail, it reduces reliability failures by 70\%, and when tuned as an inference accelerator, it improves average inference speed by over 22\% in IMDB-sentiment setting. We thus provide a principled and flexible framework for practitioners to explicitly manage the trade-off between reliability and computational efficiency.

신뢰할 수 있고 효율적인 N-최선 샘플링을 위한 상황적 품질 보상 모델

A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling

초록

Support