《干草堆概要:对长文本LLM和RAG系统的挑战》
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
July 1, 2024
作者: Philippe Laban, Alexander R. Fabbri, Caiming Xiong, Chien-Sheng Wu
cs.AI
摘要
现在,LLMs和RAG系统已经能够处理数百万个或更多的输入标记。然而,在长上下文任务上评估这些系统的输出质量仍然具有挑战性,因为像“大海捞针”这样的任务缺乏复杂性。在这项工作中,我们认为总结可以在这种评估中发挥核心作用。我们设计了一个程序来合成文档的“海堆”,确保特定的见解在文档之间重复出现。然后,“海堆摘要”(SummHay)任务要求系统处理海堆并生成,根据查询,一个能够识别相关见解并准确引用源文档的摘要。由于我们对应该出现在海堆摘要中的见解以及应该被引用的文档有明确的了解,我们实施了一个高度可重复的自动评估,可以评分摘要的两个方面 - 覆盖范围和引用。我们在两个领域(对话、新闻)生成海堆,并对10个LLMs和相应的50个RAG系统进行了大规模评估。我们的研究结果表明,SummHay对当前系统来说是一个开放挑战,因为即使系统提供了一个文档相关性的Oracle信号,也比我们对人类表现(56\%)的估计低了10个百分点以上。在没有检索器的情况下,像GPT-4o和Claude 3 Opus这样的长上下文LLMs在SummHay上得分低于20%。我们展示了SummHay也可以用来研究企业RAG系统和长上下文模型中的位置偏见。我们希望未来的系统可以在SummHay上达到甚至超越人类表现。
English
LLMs and RAG systems are now capable of handling millions of input tokens or
more. However, evaluating the output quality of such systems on long-context
tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity.
In this work, we argue that summarization can play a central role in such
evaluation. We design a procedure to synthesize Haystacks of documents,
ensuring that specific insights repeat across documents. The "Summary
of a Haystack" (SummHay) task then requires a system to process the Haystack
and generate, given a query, a summary that identifies the relevant insights
and precisely cites the source documents. Since we have precise knowledge of
what insights should appear in a haystack summary and what documents should be
cited, we implement a highly reproducible automatic evaluation that can score
summaries on two aspects - Coverage and Citation. We generate Haystacks in two
domains (conversation, news), and perform a large-scale evaluation of 10 LLMs
and corresponding 50 RAG systems. Our findings indicate that SummHay is an open
challenge for current systems, as even systems provided with an Oracle signal
of document relevance lag our estimate of human performance (56\%) by 10+
points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and
Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to
study enterprise RAG systems and position bias in long-context models. We hope
future systems can equal and surpass human performance on SummHay.Summary
AI-Generated Summary