결과 정확성만으로는 부족하다: 보상 모델의 추론 과정 정렬

초록

생성적 보상 모델(GenRM)과 LLM-as-a-Judge는 결과 정확도(Outcome Accuracy)를 우선시하도록 훈련 및 평가됨에 따라 잘못된 이유로 올바른 판단을 생산하는 '기만적 정렬(deceptive alignment)'을 보입니다. 이는 RLHF 과정에서의 일반화 능력을 저해합니다. 본 연구에서는 모델의 추론 과정과 인간의 판단 간 정렬 정도를 정량화하는 세분화된 지표인 '근거 일관성(Rationale Consistency)'을 제안합니다. 최첨단 모델에 대한 평가 결과, 근거 일관성은 최신 모델들을 효과적으로 구분하고 기만적 정렬을 탐지하는 반면, 결과 정확도는 두 측면 모두에서 한계를 보였습니다. 이러한 격차를 해결하기 위해 근거 일관성과 결과 정확도를 결합한 하이브리드 신호를 GenRM 훈련에 도입했습니다. 우리의 훈련 방법은 RM-Bench(87.1%)와 JudgeBench(82%)에서 최첨단 성능을 달성하여 결과 정확도만 사용한 기준선보다 평균 5% 향상되었습니다. RLHF 과정에서 우리의 방법을 적용한 결과, Arena Hard v2에서 증명된 바와 같이 성능이 효과적으로 개선되었으며, 특히 창의적 글쓰기 작업에서 7%의 향상을 보였습니다. 추가 분석을 통해 우리의 방법이 기만적 정렬 함정에서 벗어나, 결과 정확도만 사용한 훈련에서 관찰되던 근거 일관성 저하를 효과적으로 역전시킴을 확인했습니다.

English

Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%. Using RM during RLHF, our method effectively improves performance as demonstrated on Arena Hard v2, notably yielding a 7% improvement in creative writing tasks. Further analysis confirms that our method escapes the deceptive alignment trap, effectively reversing the decline in rationale consistency observed in outcome-only training.

결과 정확성만으로는 부족하다: 보상 모델의 추론 과정 정렬

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

초록

Support