一符愚弄LLM即法官
One Token to Fool LLM-as-a-Judge
July 11, 2025
作者: Yulai Zhao, Haolin Liu, Dian Yu, S. Y. Kung, Haitao Mi, Dong Yu
cs.AI
摘要
生成式奖励模型(亦称LLMs-as-judges),即利用大型语言模型(LLMs)评估答案质量的方法,在可验证奖励的强化学习(RLVR)中日益受到青睐。相较于刻板的基于规则的指标,它们尤其适用于涉及自由形式输出的复杂推理任务。在此模式下,通常通过提示LLM将候选答案与真实参考进行对比,并分配一个表示正确性的二元奖励。尽管这一对比任务看似简单,我们发现生成式奖励模型对表面操作表现出令人惊讶的脆弱性:非文字符号(如“:”或“.”)或推理引导语如“思考过程:”和“让我们一步步解决这个问题。”往往会导致误判奖励。我们证明,这一弱点普遍存在于不同LLM、数据集及提示格式中,对依赖生成式奖励模型的核心算法范式,如拒绝采样、偏好优化及RLVR,构成了严重威胁。为缓解此问题,我们提出了一种简单而有效的数据增强策略,并训练了一个显著提升鲁棒性的新型生成式奖励模型。我们的研究结果强调了开发更可靠的基于LLM评估方法的迫切需求。我们已在https://huggingface.co/sarosavo/Master-RM和https://huggingface.co/datasets/sarosavo/Master-RM上发布了我们稳健、通用领域的奖励模型及其合成训练数据。
English
Generative reward models (also known as LLMs-as-judges), which use large
language models (LLMs) to evaluate answer quality, are increasingly adopted in
reinforcement learning with verifiable rewards (RLVR). They are often preferred
over rigid rule-based metrics, especially for complex reasoning tasks involving
free-form outputs. In this paradigm, an LLM is typically prompted to compare a
candidate answer against a ground-truth reference and assign a binary reward
indicating correctness. Despite the seeming simplicity of this comparison task,
we find that generative reward models exhibit surprising vulnerabilities to
superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning
openers like "Thought process:" and "Let's solve this problem step by step."
can often lead to false positive rewards. We demonstrate that this weakness is
widespread across LLMs, datasets, and prompt formats, posing a serious threat
for core algorithmic paradigms that rely on generative reward models, such as
rejection sampling, preference optimization, and RLVR. To mitigate this issue,
we introduce a simple yet effective data augmentation strategy and train a new
generative reward model with substantially improved robustness. Our findings
highlight the urgent need for more reliable LLM-based evaluation methods. We
release our robust, general-domain reward model and its synthetic training data
at https://huggingface.co/sarosavo/Master-RM and
https://huggingface.co/datasets/sarosavo/Master-RM.