不放過任何文件:使用擴展的多文件問答(Multi-Doc QA)基準測試長文本上下文語言模型(LLMs)。
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
June 25, 2024
作者: Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei Huang, Yongbin Li
cs.AI
摘要
長文本建模能力引起廣泛關注,促使出現具有超長上下文窗口的大型語言模型(LLMs)。與此同時,用於評估長文本LLMs的基準逐漸趕上。然而,現有的基準使用無關的噪聲文本來人為延長測試案例的長度,偏離了長文本應用的真實場景。為彌合這一差距,我們提出了一個新穎的長文本基準Loong,通過擴展的多文檔問答(QA)與現實情境相符。與典型的文檔QA不同,在Loong的測試案例中,每個文檔與最終答案相關,忽略任何文檔都將導致答案失敗。此外,Loong引入了四種任務類型,涵蓋一系列上下文長度:焦點定位、比較、分組和推理鏈,以促進對長文本理解的更現實和全面評估。大量實驗表明,現有的長文本語言模型仍具有相當大的增強潛力。檢索增強生成(RAG)表現不佳,表明Loong能夠可靠評估模型的長文本建模能力。
English
Long-context modeling capabilities have garnered widespread attention,
leading to the emergence of Large Language Models (LLMs) with ultra-context
windows. Meanwhile, benchmarks for evaluating long-context LLMs are gradually
catching up. However, existing benchmarks employ irrelevant noise texts to
artificially extend the length of test cases, diverging from the real-world
scenarios of long-context applications. To bridge this gap, we propose a novel
long-context benchmark, Loong, aligning with realistic scenarios through
extended multi-document question answering (QA). Unlike typical document QA, in
Loong's test cases, each document is relevant to the final answer, ignoring any
document will lead to the failure of the answer. Furthermore, Loong introduces
four types of tasks with a range of context lengths: Spotlight Locating,
Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic
and comprehensive evaluation of long-context understanding. Extensive
experiments indicate that existing long-context language models still exhibit
considerable potential for enhancement. Retrieval augmented generation (RAG)
achieves poor performance, demonstrating that Loong can reliably assess the
model's long-context modeling capabilities.Summary
AI-Generated Summary