LLM模型评估器的校准
Calibrating LLM-Based Evaluator
September 23, 2023
作者: Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang
cs.AI
摘要
最近在大型语言模型(LLMs)领域的进展,以及新兴的能力使它们成为自然语言生成质量有前景的无参考评估器,以及人类评估的一个有竞争力的替代方案。然而,受限于闭源或高计算需求来托管和调整,缺乏实践来进一步校准现成的基于LLM的评估器以实现更好的与人类对齐。在这项工作中,我们提出了AutoCalibrate,这是一个多阶段、无梯度的方法,可以自动校准和对齐基于LLM的评估器以符合人类偏好。我们并未明确建模人类偏好,而是首先在一组人类标签中隐含地涵盖它们。然后,语言模型本身通过在不同的少样本示例上进行上下文学习,起草了一组初始评分标准。为了进一步校准这组标准,我们选择最佳表现者,并通过自我完善重新起草它们。我们在多个文本质量评估数据集上的实验表明,通过校准,我们与专家评估之间的相关性显著提高。我们全面的定性分析传达了对有效评分标准本质的深刻直觉和观察。
English
Recent advancements in large language models (LLMs) on language modeling and
emergent capabilities make them a promising reference-free evaluator of natural
language generation quality, and a competent alternative to human evaluation.
However, hindered by the closed-source or high computational demand to host and
tune, there is a lack of practice to further calibrate an off-the-shelf
LLM-based evaluator towards better human alignment. In this work, we propose
AutoCalibrate, a multi-stage, gradient-free approach to automatically calibrate
and align an LLM-based evaluator toward human preference. Instead of explicitly
modeling human preferences, we first implicitly encompass them within a set of
human labels. Then, an initial set of scoring criteria is drafted by the
language model itself, leveraging in-context learning on different few-shot
examples. To further calibrate this set of criteria, we select the best
performers and re-draft them with self-refinement. Our experiments on multiple
text quality evaluation datasets illustrate a significant improvement in
correlation with expert evaluation through calibration. Our comprehensive
qualitative analysis conveys insightful intuitions and observations on the
essence of effective scoring criteria.