縮小知識評估差距：多粒度答案的開放領域問答

摘要

事實性問題通常可以在不同細緻度的水平上正確回答。例如，「1961年8月4日」和「1961年」都是對問題「巴拉克·奧巴馬是何時出生的？」的正確答案。然而，標準問答（QA）評估協議並未明確考慮這一點，而是將預測答案與單一細緻度水平的答案進行比較。在這項工作中，我們提出了GRANOLA QA，一個新穎的評估設置，其中將預測答案根據準確性和信息量與一組多細緻度答案進行評估。我們提出了一種簡單的方法來豐富現有數據集的多細緻度答案，並創建了GRANOLA-EQ，這是EntityQuestions數據集的多細緻度版本。我們在GRANOLA-EQ上評估了一系列解碼方法，包括一種新算法，稱為具有響應聚合的解碼（DRAG），該算法旨在將響應的細緻度與模型的不確定性對齊。我們的實驗表明，具有標準解碼的大型語言模型往往會生成具體且常常不正確的答案。相反，當對多細緻度答案進行評估時，DRAG的平均準確性幾乎提高了近20個百分點，對於罕見實體而言進一步增加。總的來說，這顯示標準評估和解碼方案可能嚴重低估了語言模型所包含的知識。

English

Factual questions typically can be answered correctly at different levels of granularity. For example, both ``August 4, 1961'' and ``1961'' are correct answers to the question ``When was Barack Obama born?''. Standard question answering (QA) evaluation protocols, however, do not explicitly take this into account and compare a predicted answer against answers of a single granularity level. In this work, we propose GRANOLA QA, a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers. We present a simple methodology for enriching existing datasets with multi-granularity answers, and create GRANOLA-EQ, a multi-granularity version of the EntityQuestions dataset. We evaluate a range of decoding methods on GRANOLA-EQ, including a new algorithm, called Decoding with Response Aggregation (DRAG), that is geared towards aligning the response granularity with the model's uncertainty. Our experiments show that large language models with standard decoding tend to generate specific answers, which are often incorrect. In contrast, when evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities. Overall, this reveals that standard evaluation and decoding schemes may significantly underestimate the knowledge encapsulated in LMs.

縮小知識評估差距：多粒度答案的開放領域問答

Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers

摘要

Support