더 많은 문서, 동일한 길이: RAG에서 다중 문서 문제의 고립

초록

검색 강화 생성(Retrieval-Augmented Generation, RAG)은 대형 언어 모델(LLM)에 관련 문서를 제공합니다. 이전 연구에서는 많은 문서를 검색하면 성능이 저하될 수 있다고 언급했지만, 컨텍스트 길이를 통제한 상태에서 문서 수가 성능에 미치는 영향을 분리하여 분석하지는 않았습니다. 우리는 다중 홉 질의응답(Multi-hop QA) 작업에서 파생된 맞춤형 데이터셋을 사용해 다양한 언어 모델을 평가했습니다. 컨텍스트 길이와 관련 정보의 위치를 일정하게 유지하면서 문서 수를 변화시켰으며, RAG 설정에서 문서 수를 증가시키는 것이 LLM에게 상당한 어려움을 초래한다는 사실을 발견했습니다. 또한, 우리의 결과는 여러 문서를 처리하는 것이 긴 컨텍스트를 다루는 것과는 별개의 과제임을 시사합니다. 우리는 이 데이터셋과 코드를 공개했습니다: https://github.com/shaharl6000/MoreDocsSameLen.

English

Retrieval-augmented generation (RAG) provides LLMs with relevant documents. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for LLMs. Additionally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: https://github.com/shaharl6000/MoreDocsSameLen .

더 많은 문서, 동일한 길이: RAG에서 다중 문서 문제의 고립

More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG

초록

Support