《一堆乾草的摘要:對長文本LLM和RAG系統的挑戰》
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
July 1, 2024
作者: Philippe Laban, Alexander R. Fabbri, Caiming Xiong, Chien-Sheng Wu
cs.AI
摘要
現在,LLM和RAG系統已經能夠處理數百萬個或更多的輸入標記。然而,在長文本任務上評估這些系統的輸出質量仍然具有挑戰性,因為像“大海捞针”這樣的任務缺乏復雜性。在這項工作中,我們認為摘要可以在這種評估中發揮核心作用。我們設計了一個程序來綜合文檔堆,確保特定見解在文檔之間重複出現。然後,“大海捞针摘要”(SummHay)任務要求系統處理文檔堆,並生成一個摘要,根據查詢識別相關見解並準確引用來源文檔。由於我們對應該出現在文檔堆摘要中的見解和應該被引用的文檔有確切的了解,我們實現了一種高度可重現的自動評估,可以根據覆蓋範圍和引文兩個方面對摘要進行評分。我們在兩個領域(對話、新聞)生成文檔堆,並對10個LLM和相應的50個RAG系統進行大規模評估。我們的研究結果表明,對於當前系統來說,SummHay是一個開放挑戰,即使系統提供了文檔相關性的Oracle信號,也比我們對人類表現(56%)的估計低10個百分點以上的聯合得分。在沒有檢索器的情況下,像GPT-4o和Claude 3 Opus這樣的長文本LLM在SummHay上得分低於20%。我們展示了SummHay也可以用於研究企業RAG系統和長文本模型中的位置偏見。我們希望未來的系統能夠在SummHay上達到甚至超越人類的表現。
English
LLMs and RAG systems are now capable of handling millions of input tokens or
more. However, evaluating the output quality of such systems on long-context
tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity.
In this work, we argue that summarization can play a central role in such
evaluation. We design a procedure to synthesize Haystacks of documents,
ensuring that specific insights repeat across documents. The "Summary
of a Haystack" (SummHay) task then requires a system to process the Haystack
and generate, given a query, a summary that identifies the relevant insights
and precisely cites the source documents. Since we have precise knowledge of
what insights should appear in a haystack summary and what documents should be
cited, we implement a highly reproducible automatic evaluation that can score
summaries on two aspects - Coverage and Citation. We generate Haystacks in two
domains (conversation, news), and perform a large-scale evaluation of 10 LLMs
and corresponding 50 RAG systems. Our findings indicate that SummHay is an open
challenge for current systems, as even systems provided with an Oracle signal
of document relevance lag our estimate of human performance (56\%) by 10+
points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and
Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to
study enterprise RAG systems and position bias in long-context models. We hope
future systems can equal and surpass human performance on SummHay.Summary
AI-Generated Summary