샘플링, 탐색하지 않기: 언어 모델의 테스트 시점 정렬 재고

초록

테스트 시간 계산량을 증가시키는 것은, 특히 모델 미세조정이 계산적 제약이나 비공개 모델 가중치로 인해 실현 불가능하거나 불가한 시나리오에서, 언어 모델 성능을 향상시키는 유망한 방향으로 부상했습니다. 그러나 보상 모델(RM)을 사용하는 기존의 테스트 시간 탐색 방법은 본질적으로 불완전한 보상 프록시를 과도하게 최적화함으로써 계산량이 증가함에 따라 품질이 저하되는 경우가 많습니다. 우리는 QAlign라는 새로운 테스트 시간 정렬 접근 방식을 소개합니다. 테스트 시간 계산량을 확장함에 따라 QAlign는 각 개별 프롬프트에 대해 최적의 정렬된 분포에서 샘플링하는 방식으로 수렴합니다. 텍스트 생성을 위한 Markov chain Monte Carlo의 최신 발전을 채택함으로써, 우리의 방법은 기본 모델을 수정하거나 로짓 접근 권한조차 필요로 하지 않고도 더 잘 정렬된 출력을 가능하게 합니다. 우리는 작업 특화적 RM을 사용하여 수학적 추론 벤치마크(GSM8K 및 GSM-Symbolic)에서 QAlign의 효과를 입증하며, best-of-n 및 다수결 투표와 같은 기존의 테스트 시간 계산 방법에 비해 일관된 개선을 보여줍니다. 더 나아가, Tulu 3 선호 데이터셋으로 훈련된 더 현실적인 RM과 함께 적용했을 때, QAlign는 다양한 데이터셋(GSM8K, MATH500, IFEval, MMLU-Redux, TruthfulQA)에서 직접 선호 최적화(DPO), best-of-n, 다수결 투표 및 가중 다수결 투표를 능가하는 성능을 보였습니다. 추가 계산을 사용하여 테스트 시간에 언어 모델을 정렬하는 실용적인 해결책으로, 우리의 접근 방식은 추가 훈련 없이도 기성 언어 모델에서 얻을 수 있는 능력의 한계를 확장합니다.

English

Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

샘플링, 탐색하지 않기: 언어 모델의 테스트 시점 정렬 재고

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

초록

Support