更多文件，相同長度：在多文件RAG中隔離挑戰

摘要

檢索增強生成（RAG）為大型語言模型（LLMs）提供了相關文檔。儘管先前的研究指出，檢索過多文檔可能會降低性能，但這些研究並未在控制上下文長度的情況下，單獨探討文檔數量如何影響性能。我們在多跳問答任務的基礎上，使用自定義數據集評估了多種語言模型。在保持上下文長度和相關信息位置不變的同時，我們改變了文檔的數量，發現增加RAG設置中的文檔數量對LLMs構成了顯著挑戰。此外，我們的結果表明，處理多個文檔與處理長上下文是兩個不同的挑戰。我們還公開了數據集和代碼：https://github.com/shaharl6000/MoreDocsSameLen。

English

Retrieval-augmented generation (RAG) provides LLMs with relevant documents. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for LLMs. Additionally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: https://github.com/shaharl6000/MoreDocsSameLen .

更多文件，相同長度：在多文件RAG中隔離挑戰

More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG

摘要

Support