ChatPaper.aiChatPaper

J1:通过强化学习激励LLM作为评判者的思考能力

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

May 15, 2025
作者: Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, Swarnadeep Saha
cs.AI

摘要

人工智能的进展受限于评估质量,而强大的LLM-as-a-Judge模型已被证明是核心解决方案。更强的链式思维推理能力提升了判断能力,这促使我们需要寻找训练此类模型思考的最佳方法。在本研究中,我们引入了J1,一种强化学习方法来训练此类模型。我们的方法将可验证和不可验证的提示转换为具有可验证奖励的判断任务,这些奖励激励思考并减少判断偏差。特别是,当在8B或70B规模下训练时,我们的方法优于所有其他现有模型,包括从DeepSeek-R1蒸馏的模型。J1在某些基准测试中甚至超越了o1-mini和R1,尽管训练的是更小的模型。我们提供了分析和消融实验,比较了Pairwise-J1与Pointwise-J1模型、离线与在线训练方法、奖励策略、种子提示以及思维长度和内容的变化。我们发现,我们的模型通过学习制定评估标准、与自我生成的参考答案进行比较以及重新评估模型响应的正确性,做出了更好的判断。
English
The progress of AI is bottlenecked by the quality of evaluation, and powerful LLM-as-a-Judge models have proved to be a core solution. Improved judgment ability is enabled by stronger chain-of-thought reasoning, motivating the need to find the best recipes for training such models to think. In this work we introduce J1, a reinforcement learning approach to training such models. Our method converts both verifiable and non-verifiable prompts to judgment tasks with verifiable rewards that incentivize thinking and mitigate judgment bias. In particular, our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model. We provide analysis and ablations comparing Pairwise-J1 vs Pointwise-J1 models, offline vs online training recipes, reward strategies, seed prompts, and variations in thought length and content. We find that our models make better judgments by learning to outline evaluation criteria, comparing against self-generated reference answers, and re-evaluating the correctness of model responses.

Summary

AI-Generated Summary

PDF142May 16, 2025