ChatPaper.aiChatPaper

一符愚弄LLM即法官

One Token to Fool LLM-as-a-Judge

July 11, 2025
作者: Yulai Zhao, Haolin Liu, Dian Yu, S. Y. Kung, Haitao Mi, Dong Yu
cs.AI

摘要

生成式奖励模型(亦称LLMs-as-judges),即利用大型语言模型(LLMs)评估答案质量的方法,在可验证奖励的强化学习(RLVR)中日益受到青睐。相较于刻板的基于规则的指标,它们尤其适用于涉及自由形式输出的复杂推理任务。在此模式下,通常通过提示LLM将候选答案与真实参考进行对比,并分配一个表示正确性的二元奖励。尽管这一对比任务看似简单,我们发现生成式奖励模型对表面操作表现出令人惊讶的脆弱性:非文字符号(如“:”或“.”)或推理引导语如“思考过程:”和“让我们一步步解决这个问题。”往往会导致误判奖励。我们证明,这一弱点普遍存在于不同LLM、数据集及提示格式中,对依赖生成式奖励模型的核心算法范式,如拒绝采样、偏好优化及RLVR,构成了严重威胁。为缓解此问题,我们提出了一种简单而有效的数据增强策略,并训练了一个显著提升鲁棒性的新型生成式奖励模型。我们的研究结果强调了开发更可靠的基于LLM评估方法的迫切需求。我们已在https://huggingface.co/sarosavo/Master-RM和https://huggingface.co/datasets/sarosavo/Master-RM上发布了我们稳健、通用领域的奖励模型及其合成训练数据。
English
Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.
PDF253July 14, 2025