言語モデルにおける不確実性定量化評価の再考：応答長バイアス結果との疑似相互作用

要旨

言語モデル（LM）における不確実性定量化（UQ）は、その安全性と信頼性を向上させる上で極めて重要です。評価では、AUROCなどのパフォーマンス指標を用いて、UQ手法（例：負の系列確率）がタスクの正解関数（例：ROUGE-L）とどの程度相関しているかを測定することが一般的です。本論文では、広く使用されている正解関数が特定のUQ手法の性能を過大評価することで、UQ評価にバイアスをかけていることを示します。我々は、4つのデータセット×4つのモデル×6つのUQ手法に対して、語彙ベースおよび埋め込みベースの指標からLLM-as-a-judgeアプローチまで、7つの正解関数を評価しました。分析の結果、これらの正解関数の誤差における長さバイアスが、UQ手法の長さバイアスと相互作用することで、UQ評価を歪めていることが明らかになりました。我々は、LLM-as-a-judgeアプローチが最も長さバイアスの少ない選択肢の一つであり、これらのバイアスを軽減するための潜在的な解決策であることを特定しました。

English

Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for improving their safety and reliability. Evaluations often use performance metrics like AUROC to assess how well UQ methods (e.g., negative sequence probabilities) correlate with task correctness functions (e.g., ROUGE-L). In this paper, we show that commonly used correctness functions bias UQ evaluations by inflating the performance of certain UQ methods. We evaluate 7 correctness functions -- from lexical-based and embedding-based metrics to LLM-as-a-judge approaches -- across 4 datasets x 4 models x 6 UQ methods. Our analysis reveals that length biases in the errors of these correctness functions distort UQ assessments by interacting with length biases in UQ methods. We identify LLM-as-a-judge approaches as among the least length-biased choices and hence a potential solution to mitigate these biases.

言語モデルにおける不確実性定量化評価の再考：応答長バイアス結果との疑似相互作用

Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

要旨

Support