基礎自動評估器：馴服大型語言模型以獲得更佳的自動評估

摘要

隨著大型語言模型（LLMs）的進步，由於人工評估的高成本，可靠地評估它們的輸出變得更加具有挑戰性。為了朝著更好的LLM自動評估器邁進，我們引入了FLAMe，一系列基礎大型評估模型。FLAMe是在我們的大型且多樣化的100多個質量評估任務的收集上進行訓練，包括500萬多個人類判斷，這些任務是通過以前研究的公開發布的人工評估進行策劃和標準化的。FLAMe顯著改善了對各種留存任務的泛化能力，在許多任務上優於使用像GPT-4和Claude-3等專有數據訓練的LLMs。我們展示了FLAMe也可以作為進一步下游微調的強大起點，以獎勵建模評估作為案例研究（FLAMe-RM）。值得注意的是，在RewardBench上，我們的FLAMe-RM-24B模型（準確率為87.8％）是排名第一的生成模型，僅使用許可的數據進行訓練，優於GPT-4-0125（85.9％）和GPT-4o（84.7％）。此外，我們探索了一種更具計算效率的方法，使用一種新穎的尾部補丁微調策略來優化我們的FLAMe多任務混合以進行獎勵建模評估（FLAMe-Opt-RM），在需要約25倍較少的訓練數據點的情況下提供具有競爭力的RewardBench性能。總的來說，我們的FLAMe變體在我們考慮的12個自動評估基準中的8個中優於所有流行的專有LLM作為評判模型，包括53個質量評估任務，包括RewardBench和LLM-AggreFact。最後，我們的分析顯示，FLAMe在CoBBLEr自動評估器偏見基準上明顯比這些LLM作為評判模型更少偏見，同時有效地識別代碼生成的高質量響應。

English

As large language models (LLMs) advance, it becomes more challenging to reliably evaluate their output due to the high costs of human evaluation. To make progress towards better LLM autoraters, we introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks comprising 5M+ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAMe significantly improves generalization to a wide variety of held-out tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning, using reward modeling evaluation as a case study (FLAMe-RM). Notably, on RewardBench, our FLAMe-RM-24B model (with an accuracy of 87.8%) is the top-performing generative model trained exclusively on permissively licensed data, outperforming both GPT-4-0125 (85.9%) and GPT-4o (84.7%). Additionally, we explore a more computationally efficient approach using a novel tail-patch fine-tuning strategy to optimize our FLAMe multitask mixture for reward modeling evaluation (FLAMe-Opt-RM), offering competitive RewardBench performance while requiring approximately 25x less training datapoints. Overall, our FLAMe variants outperform all popular proprietary LLM-as-a-Judge models we consider across 8 out of 12 autorater evaluation benchmarks, encompassing 53 quality assessment tasks, including RewardBench and LLM-AggreFact. Finally, our analysis reveals that FLAMe is significantly less biased than these LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark, while effectively identifying high-quality responses for code generation.

基礎自動評估器：馴服大型語言模型以獲得更佳的自動評估

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

摘要

Support