當大型語言模型對其答案感到猶豫不決時——且這種不確定性是合理的
When an LLM is apprehensive about its answers -- and when its uncertainty is justified
March 3, 2025
作者: Petr Sychev, Andrey Goncharov, Daniil Vyazhev, Edvard Khalafyan, Alexey Zaytsev
cs.AI
摘要
不確定性估計對於評估大型語言模型(LLMs)至關重要,特別是在高風險領域,錯誤答案可能導致嚴重後果。許多方法在考慮這一問題時,專注於特定類型的不確定性,而忽略了其他類型。我們探討了哪些估計方法,特別是基於詞元的熵和模型作為評判者(MASJ),能夠適用於不同主題的多項選擇題回答任務。我們的實驗涵蓋了三種不同規模的LLMs:Phi-4、Mistral和Qwen,參數量從1.5B到72B不等,以及14個主題。雖然MASJ的表現與隨機錯誤預測器相似,但回應熵在知識依賴領域中能預測模型錯誤,並作為問題難度的有效指標:在生物學領域,ROC AUC為0.73。這種相關性在推理依賴領域中消失:對於數學問題,ROC-AUC為0.55。更根本的是,我們發現熵測量需要一定的推理量。因此,與數據不確定性相關的熵應整合到不確定性估計框架中,而MASJ則需要改進。此外,現有的MMLU-Pro樣本存在偏差,應平衡不同子領域所需的推理量,以提供更公平的LLMs性能評估。
English
Uncertainty estimation is crucial for evaluating Large Language Models
(LLMs), particularly in high-stakes domains where incorrect answers result in
significant consequences. Numerous approaches consider this problem, while
focusing on a specific type of uncertainty, ignoring others. We investigate
what estimates, specifically token-wise entropy and model-as-judge (MASJ),
would work for multiple-choice question-answering tasks for different question
topics. Our experiments consider three LLMs: Phi-4, Mistral, and Qwen of
different sizes from 1.5B to 72B and 14 topics. While MASJ performs similarly
to a random error predictor, the response entropy predicts model error in
knowledge-dependent domains and serves as an effective indicator of question
difficulty: for biology ROC AUC is 0.73. This correlation vanishes for the
reasoning-dependent domain: for math questions ROC-AUC is 0.55. More
principally, we found out that the entropy measure required a reasoning amount.
Thus, data-uncertainty related entropy should be integrated within uncertainty
estimates frameworks, while MASJ requires refinement. Moreover, existing
MMLU-Pro samples are biased, and should balance required amount of reasoning
for different subdomains to provide a more fair assessment of LLMs performance.Summary
AI-Generated Summary