BABILong：通过长上下文测试LLM的极限稻草堆推理

摘要

近年来，大型语言模型（LLMs）的输入上下文大小急剧增加。然而，现有的评估方法没有跟上步伐，未能全面评估模型处理长上下文的效率。为弥补这一差距，我们引入了BABILong基准，旨在测试语言模型在处理分布在极长文档中的事实时推理的能力。BABILong包括一系列多样的20个推理任务，包括事实链接、简单归纳、演绎、计数以及处理列表/集合。这些任务本身就具有挑战性，当所需事实分散在长篇自然文本中时，变得更加困难。我们的评估表明，流行的LLMs仅有效利用10-20\%的上下文，并且随着推理复杂性的增加，它们的性能急剧下降。在与上下文无关的推理替代方法中，检索增强生成方法在单事实问题回答方面达到了60\%的准确率，与上下文长度无关。在上下文扩展方法中，循环记忆变压器展现了最佳性能，能够处理长达1100万标记的长度。BABILong基准可扩展到任意长度，以支持评估具有增强功能的新模型，并提供长达100万标记长度的数据集。

English

In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.

BABILong：通过长上下文测试LLM的极限稻草堆推理

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

摘要

Support