JudgeLRM: 大規模推論モデルを裁判官として

要旨

大規模言語モデル（LLM）を評価者として活用する動きが広がり、人間によるアノテーションに代わるスケーラブルな代替手段が提供されています。しかし、既存の教師ありファインチューニング（SFT）を用いた評価者アプローチは、複雑な推論を必要とする領域では十分な性能を発揮できていません。本研究では、LLM評価者が真に強化された推論能力から恩恵を受けているかどうかを調査します。評価タスクにおける推論要件の詳細な分析を通じて、SFTの性能向上と推論を要するサンプルの割合との間に負の相関があることを明らかにし、このようなシナリオにおけるSFTの限界を浮き彫りにします。この課題に対処するため、我々はJudgeLRMを提案します。これは、評価者ごとの結果駆動型報酬を用いた強化学習（RL）で訓練された、判断指向のLLMファミリーです。JudgeLRMモデルは、SFTチューニングされたモデルや最先端の推論モデルを一貫して上回ります。特に、JudgeLRM-3BはGPT-4を凌駕し、JudgeLRM-7BはDeepSeek-R1をF1スコアで2.79%上回り、深い推論を必要とする評価タスクで特に優れた性能を発揮します。

English

The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.

JudgeLRM: 大規模推論モデルを裁判官として

JudgeLRM: Large Reasoning Models as a Judge

要旨

Support