风格胜过实质：LLM法官在对齐基准测试中的失败模式

摘要

2022年11月发布的ChatGPT引发了对后训练的极大兴趣，也带来了大量新的偏好优化（PO）方法。这些方法声称通过与人类成对偏好更好地对应来实现更优越的对齐，通常由LLM评委来衡量。在这项工作中，我们试图回答以下问题--LLM评委的偏好是否能转化为对其他更具体的对齐指标的进展，如果不能，原因是什么？我们为对齐定义了一个具体的指标，并介绍了迄今为止最大的标准化、可重现的LLM元基准SOS-Bench。我们发现：（1）LLM评判与安全性、世界知识和指令遵循等具体度量指标不相关；（2）LLM评委具有强大的隐性偏见，优先考虑风格而非事实和安全性；（3）后训练的监督微调（SFT）阶段，而非PO阶段，对对齐具有最大影响，数据扩展和提示多样性是推动因素。我们的代码库和完整结果可在https://github.com/penfever/sos-bench找到。

English

The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM judges. In this work, we attempt to answer the following question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench, the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judgments do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at https://github.com/penfever/sos-bench.

风格胜过实质：LLM法官在对齐基准测试中的失败模式

Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

摘要

Support