多項選擇題:推理使大型語言模型(LLMs)即使錯誤也更加自信
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
January 16, 2025
作者: Tairan Fu, Javier Conde, Gonzalo Martínez, María Grandury, Pedro Reviriego
cs.AI
摘要
評估語言模型(LLM)最廣泛使用的方法之一是多選題(MCQ)測試。MCQ基準可讓在幾乎任何主題上以大規模進行LLM知識測試,因為結果可以自動處理。為了幫助LLM回答,提示中可以包含幾個稱為少量樣本的示例。此外,可以要求LLM直接回答問題並選擇選項,或者先提供推理,然後再選擇答案,這被稱為思維鏈。除了檢查所選答案是否正確外,評估還可以查看LLM對其回應的估計概率,作為LLM對回應的信心指標。在本文中,我們研究了LLM對其答案的信心如何取決於模型是否被要求直接回答或在回答之前提供推理。對七個不同模型中各種主題的問題進行評估的結果顯示,當LLM在回答之前提供推理時,他們對答案更有信心。這種情況發生在所選答案是否正確的情況下。我們的假設是,這種行為是由於推理修改了所選答案的概率,因為LLM根據輸入問題和支持所做選擇的推理來預測答案。因此,LLM估計的概率似乎具有固有的限制,應該了解這些限制以便在評估程序中使用它們。有趣的是,相同的行為也觀察到在人類中,解釋答案會增加對其正確性的信心。
English
One of the most widely used methods to evaluate LLMs are Multiple Choice
Question (MCQ) tests. MCQ benchmarks enable the testing of LLM knowledge on
almost any topic at scale as the results can be processed automatically. To
help the LLM answer, a few examples called few shots can be included in the
prompt. Moreover, the LLM can be asked to answer the question directly with the
selected option or to first provide the reasoning and then the selected answer,
which is known as chain of thought. In addition to checking whether the
selected answer is correct, the evaluation can look at the LLM-estimated
probability of its response as an indication of the confidence of the LLM in
the response. In this paper, we study how the LLM confidence in its answer
depends on whether the model has been asked to answer directly or to provide
the reasoning before answering. The results of the evaluation of questions on a
wide range of topics in seven different models show that LLMs are more
confident in their answers when they provide reasoning before the answer. This
occurs regardless of whether the selected answer is correct. Our hypothesis is
that this behavior is due to the reasoning that modifies the probability of the
selected answer, as the LLM predicts the answer based on the input question and
the reasoning that supports the selection made. Therefore, LLM estimated
probabilities seem to have intrinsic limitations that should be understood in
order to use them in evaluation procedures. Interestingly, the same behavior
has been observed in humans, for whom explaining an answer increases confidence
in its correctness.Summary
AI-Generated Summary