不可验证场景下LLM训练后评估中的推理能力检验

摘要

推理型大语言模型即评判器（LLMs-as-Judges）能够通过推理时扩展获益，为将推理模型的成功经验推广至不可验证领域（即输出正确性/质量无法直接核实的场景）提供了可行路径。然而，尽管推理型评判器在静态评估基准中表现优异，但其在实际策略训练中的有效性尚未得到系统检验。为此，我们通过严格实验探究非推理型与推理型评判器在基于强化学习的大模型对齐中的实际影响。在受控合成场景下，我们利用"黄金标准"评判器（gpt-oss-120b）提供的偏好标注训练小型评判器，揭示了二者的关键差异：非推理型评判器易导致奖励破解，而推理型评判器训练出的策略能在黄金标准评判器评估中取得强劲表现。有趣的是，我们发现推理型评判器训练的策略之所以表现优异，是因为其学会了生成高效对抗性输出——这些输出不仅能欺骗其他LLM评判器，还能在Arena-Hard等流行基准测试中获得高分。结合进一步分析，本研究既揭示了（推理型）LLM评判器在不可验证领域后训练应用中的重要发现，也指出了其改进空间。

English

Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoning and reasoning judges in reinforcement-learning-based LLM alignment. Our controlled synthetic setting, where a "gold-standard" judge (gpt-oss-120b) provides preference annotations to train smaller judges, reveals key differences between non-reasoning and reasoning judges: non-reasoning judges lead to reward hacking easily, while reasoning judges can lead to policies that achieve strong performance when evaluated by the gold-standard judge. Interestingly, we find that the reasoning-judge-trained policies achieve such strong performance by learning to generate highly effective adversarial outputs that can also score well on popular benchmarks such as Arena-Hard by deceiving other LLM-judges. Combined with our further analysis, our study highlights both important findings and room for improvements for applying (reasoning) LLM-judges in non-verifiable LLM post-training.

不可验证场景下LLM训练后评估中的推理能力检验

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

摘要

Support