NeedleBench：大型语言模型能否在100万个上下文窗口中进行检索和推理？

摘要

在评估大型语言模型（LLMs）的长文本能力时，从原始长文档中识别与用户查询相关的内容对于任何LLM来说都是回答基于长文本问题的关键先决条件。我们提出了NeedleBench，这是一个框架，由一系列逐渐具有挑战性的任务组成，用于评估双语长文本能力，涵盖多个长度区间（4k、8k、32k、128k、200k、1000k及更大）和不同深度范围，允许在不同文本深度区域中策略性地插入关键数据点，以严格测试模型在不同背景下的检索和推理能力。我们使用NeedleBench框架来评估领先的开源模型在识别与问题相关的关键信息以及将该信息应用于双语长文本推理方面的能力。此外，我们提出了祖先追踪挑战（ATC），以模拟逻辑推理挑战的复杂性，这些挑战可能存在于真实世界的长文本任务中，为评估LLMs处理复杂长文本情况提供了一种简单方法。我们的结果表明，当前的LLMs在实际长文本应用中仍有很大的改进空间，因为它们在处理可能存在于真实世界长文本任务中的逻辑推理挑战的复杂性方面存在困难。所有代码和资源均可在OpenCompass获取：https://github.com/open-compass/opencompass。

English

In evaluating the long-context capabilities of large language models (LLMs), identifying content relevant to a user's query from original long documents is a crucial prerequisite for any LLM to answer questions based on long text. We present NeedleBench, a framework consisting of a series of progressively more challenging tasks for assessing bilingual long-context capabilities, spanning multiple length intervals (4k, 8k, 32k, 128k, 200k, 1000k, and beyond) and different depth ranges, allowing the strategic insertion of critical data points in different text depth zones to rigorously test the retrieval and reasoning capabilities of models in diverse contexts. We use the NeedleBench framework to assess how well the leading open-source models can identify key information relevant to the question and apply that information to reasoning in bilingual long texts. Furthermore, we propose the Ancestral Trace Challenge (ATC) to mimic the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks, providing a simple method for evaluating LLMs in dealing with complex long-context situations. Our results suggest that current LLMs have significant room for improvement in practical long-context applications, as they struggle with the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks. All codes and resources are available at OpenCompass: https://github.com/open-compass/opencompass.

NeedleBench：大型语言模型能否在100万个上下文窗口中进行检索和推理？

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

摘要

Support