一符愚弄LLM即法官

摘要

生成式奖励模型（亦称LLMs-as-judges），即利用大型语言模型（LLMs）评估答案质量的方法，在可验证奖励的强化学习（RLVR）中日益受到青睐。相较于刻板的基于规则的指标，它们尤其适用于涉及自由形式输出的复杂推理任务。在此模式下，通常通过提示LLM将候选答案与真实参考进行对比，并分配一个表示正确性的二元奖励。尽管这一对比任务看似简单，我们发现生成式奖励模型对表面操作表现出令人惊讶的脆弱性：非文字符号（如“:”或“.”）或推理引导语如“思考过程：”和“让我们一步步解决这个问题。”往往会导致误判奖励。我们证明，这一弱点普遍存在于不同LLM、数据集及提示格式中，对依赖生成式奖励模型的核心算法范式，如拒绝采样、偏好优化及RLVR，构成了严重威胁。为缓解此问题，我们提出了一种简单而有效的数据增强策略，并训练了一个显著提升鲁棒性的新型生成式奖励模型。我们的研究结果强调了开发更可靠的基于LLM评估方法的迫切需求。我们已在https://huggingface.co/sarosavo/Master-RM和https://huggingface.co/datasets/sarosavo/Master-RM上发布了我们稳健、通用领域的奖励模型及其合成训练数据。

English

Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

一符愚弄LLM即法官

One Token to Fool LLM-as-a-Judge

摘要

Support