通过因果评分框架实现稳健的奖励建模

摘要

奖励模型（RMs）是通过人类反馈对齐大型语言模型（LLMs）的基础，但它们常常面临奖励欺骗的问题。这些模型容易抓住表面或虚假的特征，如回答长度或格式，误将这些从训练数据相关性中学到的线索当作质量（如事实性、相关性）的真正因果驱动因素。这是因为标准训练目标难以区分这些因素，导致奖励模型脆弱且策略失准。我们提出了Crome（因果鲁棒奖励建模），这是一个基于明确因果模型的新框架，旨在缓解奖励欺骗。Crome在训练中采用以下合成定向增强：(1) 因果增强，即沿特定因果属性差异的配对，以单独强化对每个因果属性的敏感性；(2) 中性增强，即主要在虚假属性上变化的平局标签配对，以增强对虚假属性的不变性。值得注意的是，我们的增强是在无需了解虚假因素的情况下生成的，仅通过沿因果准则进行答案干预，这些准则通过查询一个预言机LLM确定。实证表明，Crome在RewardBench上显著优于标准基线，平均准确率提升高达5.4%，在特定类别中分别取得13.2%和7.2%的增益。Crome的鲁棒性进一步体现在Best-of-N推理设置中，随着N的增加，在包括流行的RewardBench（涵盖聊天、聊天-困难、安全性和推理任务）、专注于安全性的WildGuardTest以及专门针对推理的GSM8k在内的多个基准测试中，均获得了一致的性能提升。

English

Reward models (RMs) are fundamental to aligning Large Language Models (LLMs) via human feedback, yet they often suffer from reward hacking. They tend to latch on to superficial or spurious attributes, such as response length or formatting, mistaking these cues learned from correlations in training data for the true causal drivers of quality (e.g., factuality, relevance). This occurs because standard training objectives struggle to disentangle these factors, leading to brittle RMs and misaligned policies. We introduce Crome (Causally Robust Reward Modeling), a novel framework grounded in an explicit causal model designed to mitigate reward hacking. Crome employs the following synthetic targeted augmentations during training: (1) Causal Augmentations, which are pairs that differ along specific causal attributes, to enforce sensitivity along each causal attribute individually, and (2) Neutral Augmentations, which are tie-label pairs varying primarily in spurious attributes, to enforce invariance along spurious attributes. Notably, our augmentations are produced without any knowledge of spurious factors, via answer interventions only along causal rubrics, that are identified by querying an oracle LLM. Empirically, Crome significantly outperforms standard baselines on RewardBench, improving average accuracy by up to 5.4% and achieving gains of up to 13.2% and 7.2% in specific categories. The robustness of Crome is further testified by the consistent gains obtained in a Best-of-N inference setting across increasing N, across various benchmarks, including the popular RewardBench (covering chat, chat-hard, safety, and reasoning tasks), the safety-focused WildGuardTest, and the reasoning-specific GSM8k.

通过因果评分框架实现稳健的奖励建模

Robust Reward Modeling via Causal Rubrics

摘要

Support