重新審視語言模型中的不確定性量化評估：與回應長度偏差結果的虛假交互作用

摘要

在語言模型（LMs）中，不確定性量化（UQ）對於提升其安全性和可靠性至關重要。評估通常使用AUROC等性能指標來衡量UQ方法（例如，負序列概率）與任務正確性函數（例如，ROUGE-L）的相關性。本文中，我們指出常用的正確性函數會通過誇大某些UQ方法的性能來偏置UQ評估。我們評估了7種正確性函數——從基於詞彙和嵌入的指標到LLM作為評判者的方法——跨越4個數據集×4個模型×6種UQ方法。我們的分析揭示，這些正確性函數中的錯誤長度偏差與UQ方法中的長度偏差相互作用，從而扭曲了UQ評估。我們發現，LLM作為評判者的方法是長度偏差最小的選擇之一，因此可能是緩解這些偏差的潛在解決方案。

English

Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for improving their safety and reliability. Evaluations often use performance metrics like AUROC to assess how well UQ methods (e.g., negative sequence probabilities) correlate with task correctness functions (e.g., ROUGE-L). In this paper, we show that commonly used correctness functions bias UQ evaluations by inflating the performance of certain UQ methods. We evaluate 7 correctness functions -- from lexical-based and embedding-based metrics to LLM-as-a-judge approaches -- across 4 datasets x 4 models x 6 UQ methods. Our analysis reveals that length biases in the errors of these correctness functions distort UQ assessments by interacting with length biases in UQ methods. We identify LLM-as-a-judge approaches as among the least length-biased choices and hence a potential solution to mitigate these biases.

重新審視語言模型中的不確定性量化評估：與回應長度偏差結果的虛假交互作用

Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

摘要

Support