ChatPaper.aiChatPaper

JudgeLRM:大型推理模型作為評判者

JudgeLRM: Large Reasoning Models as a Judge

March 31, 2025
作者: Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He
cs.AI

摘要

大型語言模型(LLMs)作為評估工具的興起,提供了一種可擴展的人類註解替代方案,然而現有的監督微調(SFT)法官方法在需要複雜推理的領域往往表現不足。在本研究中,我們探討了LLM法官是否真正受益於增強推理能力。通過對評估任務中推理需求的詳細分析,我們揭示了SFT性能提升與需要推理的樣本比例之間的負相關性——這凸顯了SFT在此類情境中的局限性。為解決這一問題,我們引入了JudgeLRM,這是一系列以判斷為導向的LLMs,採用強化學習(RL)並結合法官視角、結果驅動的獎勵進行訓練。JudgeLRM模型在性能上持續超越SFT微調模型及最先進的推理模型。值得注意的是,JudgeLRM-3B超越了GPT-4,而JudgeLRM-7B在F1分數上以2.79%的優勢領先於DeepSeek-R1,尤其在需要深度推理的法官任務中表現卓越。
English
The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.
PDF626April 2, 2025