언어 모델의 불확실성 정량화 평가 재고: 응답 길이 편향 결과와의 허위 상호작용

초록

언어 모델(Language Models, LMs)의 불확실성 정량화(Uncertainty Quantification, UQ)는 모델의 안전성과 신뢰성을 향상시키는 데 중요합니다. 평가에서는 종종 AUROC와 같은 성능 지표를 사용하여 UQ 방법(예: 음의 시퀀스 확률)이 작업 정확도 함수(예: ROUGE-L)와 얼마나 잘 상관관계를 보이는지 평가합니다. 본 논문에서는 일반적으로 사용되는 정확도 함수가 특정 UQ 방법의 성능을 과대평가함으로써 UQ 평가에 편향을 초래한다는 것을 보여줍니다. 우리는 어휘 기반 및 임베딩 기반 메트릭부터 LLM-as-a-judge 접근 방식에 이르기까지 7가지 정확도 함수를 4개의 데이터셋 x 4개의 모델 x 6가지 UQ 방법에 걸쳐 평가했습니다. 분석 결과, 이러한 정확도 함수의 오류에 존재하는 길이 편향이 UQ 방법의 길이 편향과 상호작용하여 UQ 평가를 왜곡하는 것으로 나타났습니다. 우리는 LLM-as-a-judge 접근 방식이 길이 편향이 가장 적은 선택지 중 하나이며, 따라서 이러한 편향을 완화할 수 있는 잠재적 해결책으로 식별했습니다.

English

Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for improving their safety and reliability. Evaluations often use performance metrics like AUROC to assess how well UQ methods (e.g., negative sequence probabilities) correlate with task correctness functions (e.g., ROUGE-L). In this paper, we show that commonly used correctness functions bias UQ evaluations by inflating the performance of certain UQ methods. We evaluate 7 correctness functions -- from lexical-based and embedding-based metrics to LLM-as-a-judge approaches -- across 4 datasets x 4 models x 6 UQ methods. Our analysis reveals that length biases in the errors of these correctness functions distort UQ assessments by interacting with length biases in UQ methods. We identify LLM-as-a-judge approaches as among the least length-biased choices and hence a potential solution to mitigate these biases.

언어 모델의 불확실성 정량화 평가 재고: 응답 길이 편향 결과와의 허위 상호작용

Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

초록

Support