一つのトークンでLLM-as-a-Judgeを欺く

要旨

生成型報酬モデル（LLMs-as-judgesとも呼ばれる）は、大規模言語モデル（LLMs）を用いて回答の質を評価するもので、検証可能な報酬を伴う強化学習（RLVR）においてますます採用されています。特に、自由形式の出力を伴う複雑な推論タスクにおいて、厳格なルールベースの指標よりも好まれることが多いです。このパラダイムでは、通常、LLMに対して候補回答と正解参照を比較させ、正誤を示す二値の報酬を割り当てるよう促します。この比較タスクが一見単純であるにもかかわらず、生成型報酬モデルは表面的な操作に対して驚くほどの脆弱性を示すことがわかりました。例えば、非単語記号（例：「:」や「.」）や「思考プロセス:」や「この問題を段階的に解決しましょう。」といった推論の導入文が、誤った正の報酬を引き起こすことが頻繁にあります。この弱点は、LLM、データセット、プロンプト形式にわたって広く見られ、生成型報酬モデルに依存する拒否サンプリング、選好最適化、RLVRといったコアアルゴリズムパラダイムにとって深刻な脅威となっています。この問題を緩和するため、我々はシンプルでありながら効果的なデータ拡張戦略を導入し、大幅に改善された堅牢性を持つ新しい生成型報酬モデルを訓練しました。我々の研究結果は、より信頼性の高いLLMベースの評価方法の緊急の必要性を強調しています。我々は、堅牢で汎用ドメインの報酬モデルとその合成トレーニングデータをhttps://huggingface.co/sarosavo/Master-RMおよびhttps://huggingface.co/datasets/sarosavo/Master-RMで公開しています。

English

Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

一つのトークンでLLM-as-a-Judgeを欺く

One Token to Fool LLM-as-a-Judge

要旨

Support