基础自动评分器：驯服大型语言模型以获得更好的自动评估

摘要

随着大型语言模型（LLMs）的进步，由于人工评估的高成本，可靠地评估它们的输出变得更具挑战性。为了朝着更好的LLM评分器迈进，我们引入了FLAMe，即基础大型评分器模型家族。FLAMe经过训练，使用我们的100多个质量评估任务的大而多样的收集，包括500万个人类判断，这些任务经过策展并标准化，使用了先前研究中公开发布的人类评估。FLAMe显著提高了对各种未知任务的泛化能力，在许多任务上优于使用专有数据训练的LLMs，如GPT-4和Claude-3。我们展示了FLAMe也可以作为进一步下游微调的强大起点，以奖励建模评估为案例研究（FLAMe-RM）。值得注意的是，在RewardBench上，我们的FLAMe-RM-24B模型（准确率为87.8%）是排名第一的生成模型，仅使用许可数据进行训练，优于GPT-4-0125（85.9%）和GPT-4o（84.7%）。此外，我们探索了一种更具计算效率的方法，使用一种新颖的尾部补丁微调策略来优化我们的FLAMe多任务混合，用于奖励建模评估（FLAMe-Opt-RM），在要求大约25倍更少的训练数据点的情况下，提供具有竞争力的RewardBench性能。总的来说，我们的FLAMe变体在我们考虑的12个评分器评估基准中的8个中表现优于所有流行的专有LLM评分器模型，涵盖了53个质量评估任务，包括RewardBench和LLM-AggreFact。最后，我们的分析显示，FLAMe在CoBBLEr评分器偏见基准上明显比这些LLM评分器模型更少偏见，同时有效地识别了用于代码生成的高质量响应。

English

As large language models (LLMs) advance, it becomes more challenging to reliably evaluate their output due to the high costs of human evaluation. To make progress towards better LLM autoraters, we introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks comprising 5M+ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAMe significantly improves generalization to a wide variety of held-out tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning, using reward modeling evaluation as a case study (FLAMe-RM). Notably, on RewardBench, our FLAMe-RM-24B model (with an accuracy of 87.8%) is the top-performing generative model trained exclusively on permissively licensed data, outperforming both GPT-4-0125 (85.9%) and GPT-4o (84.7%). Additionally, we explore a more computationally efficient approach using a novel tail-patch fine-tuning strategy to optimize our FLAMe multitask mixture for reward modeling evaluation (FLAMe-Opt-RM), offering competitive RewardBench performance while requiring approximately 25x less training datapoints. Overall, our FLAMe variants outperform all popular proprietary LLM-as-a-Judge models we consider across 8 out of 12 autorater evaluation benchmarks, encompassing 53 quality assessment tasks, including RewardBench and LLM-AggreFact. Finally, our analysis reveals that FLAMe is significantly less biased than these LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark, while effectively identifying high-quality responses for code generation.

基础自动评分器：驯服大型语言模型以获得更好的自动评估

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

摘要

Support