検証不可能なLLM事後学習における推論評価のためのLLMジャッジの検討

要旨

推論能力を備えたLLM審査官（Reasoning LLMs-as-Judges）は、推論時のスケーリングによる恩恵を受け得るため、出力の正しさ/品質を直接検証できない非検証可能領域において、推論モデルの成功を拡張する有望な道筋を提供する。しかしながら、推論審査官が静的な評価ベンチマークでより優れた性能を示す一方で、実際の方策訓練におけるその有効性は体系的に検証されていなかった。そこで我々は、強化学習に基づくLLMアライメントにおいて、非推論審査官と推論審査官が実際に与える影響を調査するため、厳密な研究を行った。我々の制御された合成設定（「ゴールドスタンダード」審査官（gpt-oss-120b）がより小さい審査官を訓練するための嗜好注釈を提供する）により、非推論審査官と推論審査官の間の重要な差異が明らかになった：非推論審査官は報酬ハッキングを容易に引き起こすのに対し、推論審査官は、ゴールドスタンダード審査官による評価で強力な性能を達成する方策をもたらすのである。興味深いことに、推論審査官によって訓練された方策は、他のLLM審査官を欺いてArena-Hardのような一般的なベンチマークでも高得点を獲得し得る、非常に効果的な敵対的出力を生成することを学習することで、この強力な性能を達成していることがわかった。さらなる分析と合わせて、本研究は、非検証可能なLLMのポストトレーニングにおいて（推論）LLM審査官を適用する際の重要な知見と改善の余地の両方を浮き彫りにしている。

English

Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoning and reasoning judges in reinforcement-learning-based LLM alignment. Our controlled synthetic setting, where a "gold-standard" judge (gpt-oss-120b) provides preference annotations to train smaller judges, reveals key differences between non-reasoning and reasoning judges: non-reasoning judges lead to reward hacking easily, while reasoning judges can lead to policies that achieve strong performance when evaluated by the gold-standard judge. Interestingly, we find that the reasoning-judge-trained policies achieve such strong performance by learning to generate highly effective adversarial outputs that can also score well on popular benchmarks such as Arena-Hard by deceiving other LLM-judges. Combined with our further analysis, our study highlights both important findings and room for improvements for applying (reasoning) LLM-judges in non-verifiable LLM post-training.

検証不可能なLLM事後学習における推論評価のためのLLMジャッジの検討

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

要旨

Support