基於因果評分標準的穩健獎勵建模

摘要

獎勵模型（Reward Models, RMs）在通過人類反饋對齊大型語言模型（Large Language Models, LLMs）中扮演著基礎角色，然而它們常受獎勵欺騙（reward hacking）之困。這些模型傾向於抓住表面或虛假的屬性，如回應長度或格式，誤將從訓練數據相關性中學到的這些線索視為質量的真正因果驅動因素（例如，事實性、相關性）。此現象的發生，是因為標準的訓練目標難以區分這些因素，導致獎勵模型脆弱且策略失準。我們提出了Crome（因果穩健獎勵建模），這是一個基於明確因果模型的新框架，旨在減輕獎勵欺騙。Crome在訓練期間採用以下合成目標增強：(1) 因果增強，即沿特定因果屬性差異化的配對，以單獨強化對每個因果屬性的敏感性；(2) 中性增強，即主要在虛假屬性上變化的平局標籤配對，以確保對虛假屬性的不變性。值得注意的是，我們的增強是在無需了解虛假因素的情況下生成的，僅通過沿因果標準進行答案干預，這些標準是通過查詢一個預言機LLM來識別的。實證表明，Crome在RewardBench上顯著優於標準基線，平均準確率提升高達5.4%，在特定類別中分別實現了13.2%和7.2%的增益。Crome的穩健性進一步在Best-of-N推理設置中得到了驗證，隨著N的增加，在多個基準測試中均獲得了持續的增益，包括涵蓋聊天、聊天難題、安全性和推理任務的RewardBench，專注於安全性的WildGuardTest，以及專注於推理的GSM8k。

English

Reward models (RMs) are fundamental to aligning Large Language Models (LLMs) via human feedback, yet they often suffer from reward hacking. They tend to latch on to superficial or spurious attributes, such as response length or formatting, mistaking these cues learned from correlations in training data for the true causal drivers of quality (e.g., factuality, relevance). This occurs because standard training objectives struggle to disentangle these factors, leading to brittle RMs and misaligned policies. We introduce Crome (Causally Robust Reward Modeling), a novel framework grounded in an explicit causal model designed to mitigate reward hacking. Crome employs the following synthetic targeted augmentations during training: (1) Causal Augmentations, which are pairs that differ along specific causal attributes, to enforce sensitivity along each causal attribute individually, and (2) Neutral Augmentations, which are tie-label pairs varying primarily in spurious attributes, to enforce invariance along spurious attributes. Notably, our augmentations are produced without any knowledge of spurious factors, via answer interventions only along causal rubrics, that are identified by querying an oracle LLM. Empirically, Crome significantly outperforms standard baselines on RewardBench, improving average accuracy by up to 5.4% and achieving gains of up to 13.2% and 7.2% in specific categories. The robustness of Crome is further testified by the consistent gains obtained in a Best-of-N inference setting across increasing N, across various benchmarks, including the popular RewardBench (covering chat, chat-hard, safety, and reasoning tasks), the safety-focused WildGuardTest, and the reasoning-specific GSM8k.

基於因果評分標準的穩健獎勵建模

Robust Reward Modeling via Causal Rubrics

摘要

Support