ChatPaper.aiChatPaper

通过帕累托最优自监督实现大型语言模型的自动校准和错误校正。

Automatic Calibration and Error Correction for Large Language Models via Pareto Optimal Self-Supervision

June 28, 2023
作者: Theodore Zhao, Mu Wei, J. Samuel Preston, Hoifung Poon
cs.AI

摘要

大型语言模型(LLMs)展示了出色的能力,适用于广泛的应用领域,但准确性仍然是一个主要的增长领域,特别是在生物医学等关键领域。一种有效的方法来校准LLM响应的置信水平对于自动检测错误并促进人机协作验证至关重要。校准信号的一个重要来源来自专家规定的程序化监督,通常成本较低,但也有自身的局限性,如噪声和覆盖范围。在本文中,我们介绍了一种帕累托最优自我监督框架,可以利用可用的程序化监督系统地校准LLM响应,为每个响应生成风险评分,而无需额外的手动工作。这是通过学习一个协调模型来使LLM输出与其他可用的监督来源对齐来实现的,该模型会为更不确定的LLM响应分配更高的风险评分,并促进错误校正。在生物医学和一般领域的标准关系抽取任务上的实验表明了这种方法的潜力,我们提出的风险评分与LLMs的真实错误率高度相关。对于最不确定的测试实例,基于我们提出的风险评分的动态提示显著提高了现成的LLMs的准确性,在具有挑战性的评估数据集上,将GPT-3的结果提升至最新技术(SOTA)弱监督水平以上,将GPT-4的结果提升至SOTA监督结果以上。
English
Large language models (LLMs) have demonstrated remarkable capabilities out of box for a wide range of applications, yet accuracy still remains a major growth area, especially in mission-critical domains such as biomedicine. An effective method to calibrate the confidence level on LLM responses is essential to automatically detect errors and facilitate human-in-the-loop verification. An important source of calibration signals stems from expert-stipulated programmatic supervision, which is often available at low cost but has its own limitations such as noise and coverage. In this paper, we introduce a Pareto optimal self-supervision framework that can leverage available programmatic supervision to systematically calibrate LLM responses by producing a risk score for every response, without any additional manual efforts. This is accomplished by learning a harmonizer model to align LLM output with other available supervision sources, which would assign higher risk scores to more uncertain LLM responses and facilitate error correction. Experiments on standard relation extraction tasks in biomedical and general domains demonstrate the promise of this approach, with our proposed risk scores highly correlated with the real error rate of LLMs. For the most uncertain test instances, dynamic prompting based on our proposed risk scores results in significant accuracy improvement for off-the-shelf LLMs, boosting GPT-3 results past state-of-the-art (SOTA) weak supervision and GPT-4 results past SOTA supervised results on challenging evaluation datasets.
PDF31December 15, 2024