ブラックボックスアクセスによる大規模言語モデルの信頼度推定

要旨

モデルの応答に対する不確実性や信頼度を推定することは、単に応答に対する信頼だけでなく、モデル全体に対する信頼を評価する上で重要です。本論文では、ブラックボックスまたはクエリアクセスのみが可能な大規模言語モデル（LLM）の応答に対する信頼度を推定する問題を探求します。我々は、新規の特徴量を設計し、これらの特徴量に基づいて（解釈可能な）モデル（具体的にはロジスティック回帰）を訓練することで信頼度を推定する、シンプルで拡張可能なフレームワークを提案します。実験的に、このシンプルなフレームワークがflan-ul2、llama-13b、mistral-7bの信頼度推定において有効であり、TriviaQA、SQuAD、CoQA、Natural Questionsなどのベンチマークデータセットにおいて、既存のブラックボックス信頼度推定手法を最大10％以上（AUROCにおいて）上回ることを実証します。さらに、我々の解釈可能なアプローチは、信頼度を予測する特徴量に関する洞察を提供し、あるLLM向けに構築した信頼度モデルが、特定のデータセットにおいて他のLLMに対してゼロショットで一般化するという興味深く有用な発見をもたらします。

English

Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of flan-ul2, llama-13b and mistral-7b with it consistently outperforming existing black-box confidence estimation approaches on benchmark datasets such as TriviaQA, SQuAD, CoQA and Natural Questions by even over 10% (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.

ブラックボックスアクセスによる大規模言語モデルの信頼度推定

Large Language Model Confidence Estimation via Black-Box Access

要旨

Support