當大型語言模型對其答案感到猶豫不決時——且這種不確定性是合理的

摘要

不確定性估計對於評估大型語言模型（LLMs）至關重要，特別是在高風險領域，錯誤答案可能導致嚴重後果。許多方法在考慮這一問題時，專注於特定類型的不確定性，而忽略了其他類型。我們探討了哪些估計方法，特別是基於詞元的熵和模型作為評判者（MASJ），能夠適用於不同主題的多項選擇題回答任務。我們的實驗涵蓋了三種不同規模的LLMs：Phi-4、Mistral和Qwen，參數量從1.5B到72B不等，以及14個主題。雖然MASJ的表現與隨機錯誤預測器相似，但回應熵在知識依賴領域中能預測模型錯誤，並作為問題難度的有效指標：在生物學領域，ROC AUC為0.73。這種相關性在推理依賴領域中消失：對於數學問題，ROC-AUC為0.55。更根本的是，我們發現熵測量需要一定的推理量。因此，與數據不確定性相關的熵應整合到不確定性估計框架中，而MASJ則需要改進。此外，現有的MMLU-Pro樣本存在偏差，應平衡不同子領域所需的推理量，以提供更公平的LLMs性能評估。

English

Uncertainty estimation is crucial for evaluating Large Language Models (LLMs), particularly in high-stakes domains where incorrect answers result in significant consequences. Numerous approaches consider this problem, while focusing on a specific type of uncertainty, ignoring others. We investigate what estimates, specifically token-wise entropy and model-as-judge (MASJ), would work for multiple-choice question-answering tasks for different question topics. Our experiments consider three LLMs: Phi-4, Mistral, and Qwen of different sizes from 1.5B to 72B and 14 topics. While MASJ performs similarly to a random error predictor, the response entropy predicts model error in knowledge-dependent domains and serves as an effective indicator of question difficulty: for biology ROC AUC is 0.73. This correlation vanishes for the reasoning-dependent domain: for math questions ROC-AUC is 0.55. More principally, we found out that the entropy measure required a reasoning amount. Thus, data-uncertainty related entropy should be integrated within uncertainty estimates frameworks, while MASJ requires refinement. Moreover, existing MMLU-Pro samples are biased, and should balance required amount of reasoning for different subdomains to provide a more fair assessment of LLMs performance.

當大型語言模型對其答案感到猶豫不決時——且這種不確定性是合理的

When an LLM is apprehensive about its answers -- and when its uncertainty is justified

摘要

Support