不放过任何文档:使用扩展的多文档问答对长上下文语言模型进行基准测试
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
June 25, 2024
作者: Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei Huang, Yongbin Li
cs.AI
摘要
长文本建模能力引起了广泛关注,导致了具有超长上下文窗口的大型语言模型(LLMs)的出现。与此同时,用于评估长上下文LLMs的基准逐渐在迎头赶上。然而,现有的基准采用无关的噪声文本来人为延长测试用例的长度,与长上下文应用的真实场景背道而驰。为了弥合这一差距,我们提出了一个新颖的长上下文基准Loong,通过扩展的多文档问答(QA)与现实场景保持一致。与典型的文档问答不同,在Loong的测试用例中,每个文档都与最终答案相关,忽略任何文档都将导致答案失败。此外,Loong引入了四种任务类型,涵盖一系列上下文长度:焦点定位、比较、聚类和推理链,以促进对长上下文理解的更加真实和全面的评估。大量实验证明,现有的长上下文语言模型仍具有相当大的增强潜力。检索增强生成(RAG)表现不佳,表明Loong能够可靠地评估模型的长上下文建模能力。
English
Long-context modeling capabilities have garnered widespread attention,
leading to the emergence of Large Language Models (LLMs) with ultra-context
windows. Meanwhile, benchmarks for evaluating long-context LLMs are gradually
catching up. However, existing benchmarks employ irrelevant noise texts to
artificially extend the length of test cases, diverging from the real-world
scenarios of long-context applications. To bridge this gap, we propose a novel
long-context benchmark, Loong, aligning with realistic scenarios through
extended multi-document question answering (QA). Unlike typical document QA, in
Loong's test cases, each document is relevant to the final answer, ignoring any
document will lead to the failure of the answer. Furthermore, Loong introduces
four types of tasks with a range of context lengths: Spotlight Locating,
Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic
and comprehensive evaluation of long-context understanding. Extensive
experiments indicate that existing long-context language models still exhibit
considerable potential for enhancement. Retrieval augmented generation (RAG)
achieves poor performance, demonstrating that Loong can reliably assess the
model's long-context modeling capabilities.Summary
AI-Generated Summary