スタイル重視：LLM裁判官のアライメントベンチマークにおける失敗モード

要旨

2022年11月にChatGPTがリリースされ、ポストトレーニングと新しい優先度最適化（PO）手法の爆発的な関心が引き起こされました。これらの手法は、しばしばLLM判定者によって測定される人間のペアワイズな選好とのより良い対応によって、優れた整合性を主張しています。本研究では、LLM判定者の選好が他のより具体的な整合性メトリクスにどのように変換されるか、そしてそうでない場合はなぜかについての問いに取り組みます。我々は整合性の具体的なメトリクスを定義し、これまでで最大の標準化された再現可能なLLMメタベンチマークであるSOS-Benchを紹介します。我々は次のことを発見しました：（1）LLM判定は安全性、世界知識、および指示の遵守といった具体的なメトリクスと相関しない；（2）LLM判定者は、事実よりもスタイルや安全性を優先する強力な暗黙のバイアスを持っている；そして（3）ポストトレーニングの監督されたファインチューニング（SFT）段階が整合性に最も大きな影響を与え、データのスケーリングとプロンプトの多様性が駆動要因であることがわかりました。我々のコードベースと完全な結果は、https://github.com/penfever/sos-bench で入手できます。

English

The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM judges. In this work, we attempt to answer the following question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench, the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judgments do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at https://github.com/penfever/sos-bench.

スタイル重視：LLM裁判官のアライメントベンチマークにおける失敗モード

Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

要旨

Support