ChatPaper.aiChatPaper

校準基於LLM的評估器

Calibrating LLM-Based Evaluator

September 23, 2023
作者: Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang
cs.AI

摘要

最近在大型語言模型(LLMs)的語言建模和新興能力方面取得的進展,使它們成為一種有前景的無參考評估自然語言生成質量的工具,並且是與人類評估相比的一種有競爭力的替代方案。然而,受限於封閉源碼或高計算需求以進行主機託管和調整,目前缺乏實踐來進一步校準現成的LLM-based評估器以實現更好的與人類對齊。在這項工作中,我們提出了AutoCalibrate,這是一種多階段、無梯度方法,用於自動校準和對齊基於LLM的評估器以符合人類偏好。我們不是直接對人類偏好進行建模,而是首先將它們隱含地包含在一組人類標籤中。然後,語言模型本身根據不同的少量樣本進行上下文學習,起草了一組初始的評分標準。為了進一步校準這一組標準,我們選擇最佳表現者並通過自我精煉重新起草它們。我們對多個文本質量評估數據集的實驗表明,通過校準,與專家評估之間的相關性顯著提高。我們的全面定性分析傳達了對有效評分標準本質的深刻直覺和觀察。
English
Recent advancements in large language models (LLMs) on language modeling and emergent capabilities make them a promising reference-free evaluator of natural language generation quality, and a competent alternative to human evaluation. However, hindered by the closed-source or high computational demand to host and tune, there is a lack of practice to further calibrate an off-the-shelf LLM-based evaluator towards better human alignment. In this work, we propose AutoCalibrate, a multi-stage, gradient-free approach to automatically calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Then, an initial set of scoring criteria is drafted by the language model itself, leveraging in-context learning on different few-shot examples. To further calibrate this set of criteria, we select the best performers and re-draft them with self-refinement. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration. Our comprehensive qualitative analysis conveys insightful intuitions and observations on the essence of effective scoring criteria.
PDF121December 15, 2024