LLM이 자신의 답변에 대해 불안해할 때 -- 그리고 그 불확실성이 정당화될 때

초록

불확실성 추정은 대규모 언어 모델(LLM)을 평가하는 데 있어 특히 잘못된 답변이 중대한 결과를 초래할 수 있는 고위험 영역에서 중요합니다. 이 문제를 다루는 다양한 접근법들이 있지만, 특정 유형의 불확실성에만 초점을 맞추고 다른 유형은 무시하는 경우가 많습니다. 본 연구에서는 다양한 주제의 객관식 문제 해결 작업에 대해 토큰 단위 엔트로피(token-wise entropy)와 모델-판단자(model-as-judge, MASJ)와 같은 추정치가 어떻게 작동하는지 조사합니다. 실험에서는 1.5B에서 72B까지 다양한 크기의 세 가지 LLM(Phi-4, Mistral, Qwen)과 14개의 주제를 고려했습니다. MASJ는 무작위 오류 예측기와 유사한 성능을 보인 반면, 응답 엔트로피는 지식 의존적 영역에서 모델 오류를 예측하고 문제 난이도의 효과적인 지표로 작용했습니다: 생물학의 경우 ROC AUC는 0.73입니다. 그러나 이 상관관계는 추론 의존적 영역에서는 사라집니다: 수학 문제의 경우 ROC-AUC는 0.55입니다. 더 근본적으로, 엔트로피 측정은 일정량의 추론을 필요로 한다는 사실을 발견했습니다. 따라서 데이터 불확실성과 관련된 엔트로피는 불확실성 추정 프레임워크 내에 통합되어야 하며, MASJ는 개선이 필요합니다. 또한 기존의 MMLU-Pro 샘플은 편향되어 있으며, LLM 성능을 보다 공정하게 평가하기 위해 다양한 하위 영역에 필요한 추론 양을 균형 있게 조정해야 합니다.

English

Uncertainty estimation is crucial for evaluating Large Language Models (LLMs), particularly in high-stakes domains where incorrect answers result in significant consequences. Numerous approaches consider this problem, while focusing on a specific type of uncertainty, ignoring others. We investigate what estimates, specifically token-wise entropy and model-as-judge (MASJ), would work for multiple-choice question-answering tasks for different question topics. Our experiments consider three LLMs: Phi-4, Mistral, and Qwen of different sizes from 1.5B to 72B and 14 topics. While MASJ performs similarly to a random error predictor, the response entropy predicts model error in knowledge-dependent domains and serves as an effective indicator of question difficulty: for biology ROC AUC is 0.73. This correlation vanishes for the reasoning-dependent domain: for math questions ROC-AUC is 0.55. More principally, we found out that the entropy measure required a reasoning amount. Thus, data-uncertainty related entropy should be integrated within uncertainty estimates frameworks, while MASJ requires refinement. Moreover, existing MMLU-Pro samples are biased, and should balance required amount of reasoning for different subdomains to provide a more fair assessment of LLMs performance.

LLM이 자신의 답변에 대해 불안해할 때 -- 그리고 그 불확실성이 정당화될 때

When an LLM is apprehensive about its answers -- and when its uncertainty is justified

초록

Support