블랙박스 접근을 통한 대형 언어 모델 신뢰도 추정

초록

모델 응답에 대한 불확실성 또는 신뢰도를 추정하는 것은 단순히 응답에 대한 신뢰뿐만 아니라 모델 전체에 대한 신뢰를 평가하는 데 있어 중요한 요소가 될 수 있습니다. 본 논문에서는 블랙박스 또는 쿼리 접근만 가능한 대규모 언어 모델(LLM)의 응답에 대한 신뢰도를 추정하는 문제를 탐구합니다. 우리는 간단하고 확장 가능한 프레임워크를 제안하며, 이 프레임워크에서는 새로운 특징을 설계하고 이러한 특징에 대해 (해석 가능한) 모델(즉, 로지스틱 회귀)을 학습시켜 신뢰도를 추정합니다. 우리는 실험적으로 이 간단한 프레임워크가 flan-ul2, llama-13b, mistral-7b의 신뢰도를 추정하는 데 효과적임을 입증하며, TriviaQA, SQuAD, CoQA, Natural Questions와 같은 벤치마크 데이터셋에서 기존의 블랙박스 신뢰도 추정 접근법을 일관되게 능가하는 것을 보여줍니다. 경우에 따라 AUROC 기준으로 10% 이상의 성능 향상을 보이기도 합니다. 또한, 우리의 해석 가능한 접근 방식은 신뢰도를 예측하는 데 유용한 특징에 대한 통찰을 제공하며, 한 LLM을 위해 구축된 신뢰도 모델이 주어진 데이터셋에서 다른 LLM에 대해 제로샷으로 일반화된다는 흥미롭고 유용한 발견을 이끌어냅니다.

English

Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of flan-ul2, llama-13b and mistral-7b with it consistently outperforming existing black-box confidence estimation approaches on benchmark datasets such as TriviaQA, SQuAD, CoQA and Natural Questions by even over 10% (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.

블랙박스 접근을 통한 대형 언어 모델 신뢰도 추정

Large Language Model Confidence Estimation via Black-Box Access

초록

Support