通过帕累托最优自监督实现大型语言模型的自动校准和错误校正。
Automatic Calibration and Error Correction for Large Language Models via Pareto Optimal Self-Supervision
June 28, 2023
作者: Theodore Zhao, Mu Wei, J. Samuel Preston, Hoifung Poon
cs.AI
摘要
大型语言模型(LLMs)展示了出色的能力,适用于广泛的应用领域,但准确性仍然是一个主要的增长领域,特别是在生物医学等关键领域。一种有效的方法来校准LLM响应的置信水平对于自动检测错误并促进人机协作验证至关重要。校准信号的一个重要来源来自专家规定的程序化监督,通常成本较低,但也有自身的局限性,如噪声和覆盖范围。在本文中,我们介绍了一种帕累托最优自我监督框架,可以利用可用的程序化监督系统地校准LLM响应,为每个响应生成风险评分,而无需额外的手动工作。这是通过学习一个协调模型来使LLM输出与其他可用的监督来源对齐来实现的,该模型会为更不确定的LLM响应分配更高的风险评分,并促进错误校正。在生物医学和一般领域的标准关系抽取任务上的实验表明了这种方法的潜力,我们提出的风险评分与LLMs的真实错误率高度相关。对于最不确定的测试实例,基于我们提出的风险评分的动态提示显著提高了现成的LLMs的准确性,在具有挑战性的评估数据集上,将GPT-3的结果提升至最新技术(SOTA)弱监督水平以上,将GPT-4的结果提升至SOTA监督结果以上。
English
Large language models (LLMs) have demonstrated remarkable capabilities out of
box for a wide range of applications, yet accuracy still remains a major growth
area, especially in mission-critical domains such as biomedicine. An effective
method to calibrate the confidence level on LLM responses is essential to
automatically detect errors and facilitate human-in-the-loop verification. An
important source of calibration signals stems from expert-stipulated
programmatic supervision, which is often available at low cost but has its own
limitations such as noise and coverage. In this paper, we introduce a Pareto
optimal self-supervision framework that can leverage available programmatic
supervision to systematically calibrate LLM responses by producing a risk score
for every response, without any additional manual efforts. This is accomplished
by learning a harmonizer model to align LLM output with other available
supervision sources, which would assign higher risk scores to more uncertain
LLM responses and facilitate error correction. Experiments on standard relation
extraction tasks in biomedical and general domains demonstrate the promise of
this approach, with our proposed risk scores highly correlated with the real
error rate of LLMs. For the most uncertain test instances, dynamic prompting
based on our proposed risk scores results in significant accuracy improvement
for off-the-shelf LLMs, boosting GPT-3 results past state-of-the-art (SOTA)
weak supervision and GPT-4 results past SOTA supervised results on challenging
evaluation datasets.