BABILong: 긴 문맥에서의 LLM 한계 테스트 건초 더미 속 추론

초록

최근 몇 년 동안 대규모 언어 모델(LLM)의 입력 컨텍스트 크기가 급격히 증가했습니다. 그러나 기존의 평가 방법들은 이러한 발전을 따라가지 못했으며, 긴 컨텍스트를 처리하는 모델의 효율성을 포괄적으로 평가하지 못했습니다. 이러한 격차를 해소하기 위해, 우리는 극도로 긴 문서에 분포된 사실들을 통해 언어 모델의 추론 능력을 테스트하기 위해 설계된 BABILong 벤치마크를 소개합니다. BABILong은 사실 연결, 단순 귀납, 연역, 계수, 리스트/집합 처리 등 20가지 다양한 추론 작업을 포함합니다. 이러한 작업들은 그 자체로도 도전적이며, 필요한 사실들이 긴 자연어 텍스트에 흩어져 있을 때는 더욱 어려워집니다. 우리의 평가 결과에 따르면, 인기 있는 LLM들은 컨텍스트의 10-20%만 효과적으로 활용하며, 추론 복잡성이 증가함에 따라 성능이 급격히 저하됩니다. 컨텍스트 내 추론의 대안 중에서, 검색 강화 생성(Retrieval-Augmented Generation) 방법들은 단일 사실 질문 응답에서 컨텍스트 길이와 무관하게 약 60%의 정확도를 달성합니다. 컨텍스트 확장 방법 중에서는 순환 메모리 트랜스포머(Recurrent Memory Transformer)가 최고 성능을 보이며, 최대 1,100만 토큰 길이까지 처리할 수 있습니다. BABILong 벤치마크는 향후 증가된 능력을 가진 새로운 모델들의 평가를 지원하기 위해 어떤 길이로도 확장 가능하며, 우리는 최대 100만 토큰 길이까지의 분할을 제공합니다.

English

In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.

BABILong: 긴 문맥에서의 LLM 한계 테스트 건초 더미 속 추론

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

초록

Support