通过黑盒访问进行大型语言模型置信度估计

摘要

在评估模型响应的不确定性或置信度方面可能是非常重要的，这不仅可以评估对响应的信任度，还可以评估整个模型的信任度。在本文中，我们探讨了如何估计具有黑盒或查询访问权限的大型语言模型（LLMs）响应的置信度问题。我们提出了一个简单且可扩展的框架，在这个框架中，我们设计了新颖的特征，并在这些特征上训练了一个（可解释的）模型（即逻辑回归）来估计置信度。我们通过实验证明，我们的简单框架在估计flan-ul2、llama-13b和mistral-7b的置信度方面非常有效，其在诸如TriviaQA、SQuAD、CoQA和自然问题等基准数据集上的表现甚至在某些情况下超过10%（在AUROC上）。此外，我们的可解释方法提供了对置信度预测有预测能力的特征的洞察，这导致了一个有趣且有用的发现，即我们为一个LLM构建的置信度模型可以在给定数据集上零-shot泛化到其他模型。

English

Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of flan-ul2, llama-13b and mistral-7b with it consistently outperforming existing black-box confidence estimation approaches on benchmark datasets such as TriviaQA, SQuAD, CoQA and Natural Questions by even over 10% (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.

通过黑盒访问进行大型语言模型置信度估计

Large Language Model Confidence Estimation via Black-Box Access

摘要

Support