건초 더미 속 요약: 장문맥 LLM과 RAG 시스템에 대한 도전

초록

LLM(대형 언어 모델)과 RAG(검색-증강 생성) 시스템은 이제 수백만 개의 입력 토큰 이상을 처리할 수 있습니다. 그러나 긴 문맥 작업에서 이러한 시스템의 출력 품질을 평가하는 것은 여전히 어려운 과제로 남아 있습니다. 특히 '건초 더미 속 바늘 찾기(Needle-in-a-Haystack)'와 같은 작업은 복잡성이 부족합니다. 본 연구에서는 요약(summarization)이 이러한 평가에서 중심적인 역할을 할 수 있다고 주장합니다. 우리는 특정 통찰력이 여러 문서에 걸쳐 반복되도록 문서 더미(Haystack)를 합성하는 절차를 설계했습니다. "건초 더미 요약(Summary of a Haystack, SummHay)" 작업은 시스템이 건초 더미를 처리하고, 주어진 쿼리에 대해 관련 통찰력을 식별하고 정확하게 출처 문서를 인용한 요약을 생성하도록 요구합니다. 건초 더미 요약에 어떤 통찰력이 포함되어야 하고 어떤 문서가 인용되어야 하는지 정확히 알고 있기 때문에, 우리는 높은 재현성을 가진 자동 평가를 구현하여 요약을 '포괄성(Coverage)'과 '인용(Citation)' 두 가지 측면에서 점수화할 수 있습니다. 우리는 대화와 뉴스 두 가지 도메인에서 건초 더미를 생성하고, 10개의 LLM과 이에 대응하는 50개의 RAG 시스템에 대한 대규모 평가를 수행했습니다. 연구 결과에 따르면 SummHay는 현재 시스템들에게 열린 도전 과제로, 문서 관련성을 나타내는 오라클 신호를 제공받은 시스템조차도 인간의 예상 성능(56%)보다 10점 이상 낮은 '통합 점수(Joint Score)'를 기록했습니다. 검색기가 없는 경우, GPT-4o 및 Claude 3 Opus와 같은 긴 문맥 LLM은 SummHay에서 20% 미만의 점수를 기록했습니다. 우리는 SummHay가 기업용 RAG 시스템과 긴 문맥 모델의 위치 편향(position bias)을 연구하는 데에도 사용될 수 있음을 보여줍니다. 우리는 미래의 시스템이 SummHay에서 인간의 성능을 따라잡고 능가할 수 있기를 기대합니다.

English

LLMs and RAG systems are now capable of handling millions of input tokens or more. However, evaluating the output quality of such systems on long-context tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity. In this work, we argue that summarization can play a central role in such evaluation. We design a procedure to synthesize Haystacks of documents, ensuring that specific insights repeat across documents. The "Summary of a Haystack" (SummHay) task then requires a system to process the Haystack and generate, given a query, a summary that identifies the relevant insights and precisely cites the source documents. Since we have precise knowledge of what insights should appear in a haystack summary and what documents should be cited, we implement a highly reproducible automatic evaluation that can score summaries on two aspects - Coverage and Citation. We generate Haystacks in two domains (conversation, news), and perform a large-scale evaluation of 10 LLMs and corresponding 50 RAG systems. Our findings indicate that SummHay is an open challenge for current systems, as even systems provided with an Oracle signal of document relevance lag our estimate of human performance (56\%) by 10+ points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to study enterprise RAG systems and position bias in long-context models. We hope future systems can equal and surpass human performance on SummHay.

건초 더미 속 요약: 장문맥 LLM과 RAG 시스템에 대한 도전

Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

초록

Support