一令愚弄大模型即法官

摘要

生成式獎勵模型（亦稱作LLMs-as-judges），利用大型語言模型（LLMs）來評估回答質量，在可驗證獎勵的強化學習（RLVR）中日益受到採用。相較於僵化的基於規則的指標，它們在處理涉及自由形式輸出的複雜推理任務時尤為受青睞。在此範式中，通常會提示LLM將候選答案與真實參考答案進行比較，並分配一個二元獎勵以指示正確性。儘管這項比較任務看似簡單，我們發現生成式獎勵模型對表面操作展現出驚人的脆弱性：非文字符號（如“:”或“.”）或推理開場白如“思考過程：”和“讓我們一步步解決這個問題。”往往會導致錯誤的正向獎勵。我們證明，這一弱點在LLMs、數據集及提示格式中普遍存在，對依賴生成式獎勵模型的核心算法範式，如拒絕採樣、偏好優化和RLVR，構成了嚴重威脅。為緩解此問題，我們引入了一種簡單而有效的數據增強策略，並訓練了一個顯著提升魯棒性的新型生成式獎勵模型。我們的研究結果強調了開發更可靠的基於LLM的評估方法的迫切需求。我們在https://huggingface.co/sarosavo/Master-RM和https://huggingface.co/datasets/sarosavo/Master-RM上發布了我們魯棒、通用領域的獎勵模型及其合成訓練數據。

English

Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

一令愚弄大模型即法官

One Token to Fool LLM-as-a-Judge

摘要

Support