BABILong：透過長文本測試LLM的極限 Reasoning-in-a-Haystack

摘要

近年來，大型語言模型（LLMs）的輸入上下文大小急劇增加。然而，現有的評估方法未能跟上步伐，未能全面評估模型處理長上下文的效率。為彌補這一差距，我們引入了BABILong基準，旨在測試語言模型在處理分佈在極長文檔中的事實時的推理能力。BABILong包括一系列多樣的20個推理任務，包括事實鏈接、簡單歸納、演繹、計數以及處理列表/集合。這些任務本身就具有挑戰性，當所需事實分佈在長篇自然文本中時，更加困難。我們的評估顯示，熱門的LLMs僅有效利用10-20％的上下文，隨著推理複雜度的增加，性能急劇下降。在與上下文推理相關的替代方法中，檢索增強生成方法在單事實問答上實現了60％的準確率，與上下文長度無關。在上下文擴展方法中，循環記憶變壓器展現了最佳性能，能夠處理長達1100萬標記的文本。BABILong基準可擴展到任意長度，以支持對具有增強功能的新模型的評估，我們提供了長達100萬標記長度的數據集。

English

In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.

BABILong：透過長文本測試LLM的極限 Reasoning-in-a-Haystack

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

摘要

Support