인과적 루브릭을 통한 강건한 보상 모델링

초록

보상 모델(Reward Models, RMs)은 인간 피드백을 통해 대형 언어 모델(Large Language Models, LLMs)을 정렬(align)하는 데 필수적이지만, 종종 보상 해킹(reward hacking) 문제에 직면합니다. 이 모델들은 응답 길이나 형식과 같은 피상적이거나 허위적인 속성에 집착하는 경향이 있으며, 훈련 데이터에서 학습된 상관관계를 질적 요소(예: 사실성, 관련성)의 진정한 원인으로 오해합니다. 이는 표준 훈련 목표가 이러한 요소들을 분리하는 데 어려움을 겪기 때문에 발생하며, 이로 인해 취약한 보상 모델과 잘못 정렬된 정책이 만들어집니다. 우리는 이러한 보상 해킹을 완화하기 위해 명시적인 인과 모델에 기반한 새로운 프레임워크인 Crome(Causally Robust Reward Modeling)을 소개합니다. Crome은 훈련 중에 다음과 같은 합성적이고 목표 지향적인 증강 기법을 사용합니다: (1) 특정 인과 속성에 따라 차이가 나는 쌍으로 구성된 **인과 증강(Causal Augmentations)**을 통해 각 인과 속성에 대한 민감성을 강화하고, (2) 주로 허위 속성에서 차이가 나는 동점 레이블 쌍으로 구성된 **중립 증강(Neutral Augmentations)**을 통해 허위 속성에 대한 불변성을 강화합니다. 특히, 우리의 증강 기법은 허위 요소에 대한 사전 지식 없이, 오라클 LLM을 질의하여 식별된 인과 기준(causal rubrics)에 따라 답변을 개입함으로써 생성됩니다. 실험적으로, Crome은 RewardBench에서 표준 베이스라인을 크게 능가하며, 평균 정확도를 최대 5.4% 향상시키고 특정 카테고리에서 각각 최대 13.2%와 7.2%의 성능 향상을 달성했습니다. Crome의 견고성은 Best-of-N 추론 설정에서 N이 증가함에 따라 일관된 성능 향상을 보이는 것으로도 입증되었으며, 이는 RewardBench(채팅, 채팅-하드, 안전성, 추론 작업 포함), 안전성 중심의 WildGuardTest, 추론 특화 GSM8k 등 다양한 벤치마크에서 확인되었습니다.

English

Reward models (RMs) are fundamental to aligning Large Language Models (LLMs) via human feedback, yet they often suffer from reward hacking. They tend to latch on to superficial or spurious attributes, such as response length or formatting, mistaking these cues learned from correlations in training data for the true causal drivers of quality (e.g., factuality, relevance). This occurs because standard training objectives struggle to disentangle these factors, leading to brittle RMs and misaligned policies. We introduce Crome (Causally Robust Reward Modeling), a novel framework grounded in an explicit causal model designed to mitigate reward hacking. Crome employs the following synthetic targeted augmentations during training: (1) Causal Augmentations, which are pairs that differ along specific causal attributes, to enforce sensitivity along each causal attribute individually, and (2) Neutral Augmentations, which are tie-label pairs varying primarily in spurious attributes, to enforce invariance along spurious attributes. Notably, our augmentations are produced without any knowledge of spurious factors, via answer interventions only along causal rubrics, that are identified by querying an oracle LLM. Empirically, Crome significantly outperforms standard baselines on RewardBench, improving average accuracy by up to 5.4% and achieving gains of up to 13.2% and 7.2% in specific categories. The robustness of Crome is further testified by the consistent gains obtained in a Best-of-N inference setting across increasing N, across various benchmarks, including the popular RewardBench (covering chat, chat-hard, safety, and reasoning tasks), the safety-focused WildGuardTest, and the reasoning-specific GSM8k.

인과적 루브릭을 통한 강건한 보상 모델링

Robust Reward Modeling via Causal Rubrics

초록

Support