NeedleBench: LLMは100万トークンのコンテキストウィンドウで検索と推論が可能か？

要旨

大規模言語モデル（LLM）の長文脈処理能力を評価するにあたり、元の長文書からユーザーのクエリに関連する内容を特定することは、LLMが長文に基づいて質問に答えるための重要な前提条件です。本論文では、NeedleBenchというフレームワークを提案します。これは、複数の長さ区間（4k、8k、32k、128k、200k、1000k、およびそれ以上）と異なる深さ範囲にわたる、二言語長文脈能力を評価するための一連の段階的に難易度が上がるタスクで構成されています。これにより、異なるテキスト深さゾーンに重要なデータポイントを戦略的に挿入し、多様な文脈におけるモデルの検索および推論能力を厳密にテストすることが可能です。NeedleBenchフレームワークを使用して、主要なオープンソースモデルが質問に関連するキー情報をどの程度うまく特定し、その情報を二言語長文における推論に適用できるかを評価します。さらに、実世界の長文脈タスクに存在する可能性が高い論理推論の複雑さを模倣するために、Ancestral Trace Challenge（ATC）を提案し、複雑な長文脈状況に対処するLLMを評価するためのシンプルな方法を提供します。我々の結果は、現在のLLMが実用的な長文脈アプリケーションにおいて、実世界の長文脈タスクに存在する可能性が高い論理推論の複雑さに苦戦していることから、改善の余地が大きいことを示唆しています。すべてのコードとリソースはOpenCompassで公開されています：https://github.com/open-compass/opencompass。

English

In evaluating the long-context capabilities of large language models (LLMs), identifying content relevant to a user's query from original long documents is a crucial prerequisite for any LLM to answer questions based on long text. We present NeedleBench, a framework consisting of a series of progressively more challenging tasks for assessing bilingual long-context capabilities, spanning multiple length intervals (4k, 8k, 32k, 128k, 200k, 1000k, and beyond) and different depth ranges, allowing the strategic insertion of critical data points in different text depth zones to rigorously test the retrieval and reasoning capabilities of models in diverse contexts. We use the NeedleBench framework to assess how well the leading open-source models can identify key information relevant to the question and apply that information to reasoning in bilingual long texts. Furthermore, we propose the Ancestral Trace Challenge (ATC) to mimic the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks, providing a simple method for evaluating LLMs in dealing with complex long-context situations. Our results suggest that current LLMs have significant room for improvement in practical long-context applications, as they struggle with the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks. All codes and resources are available at OpenCompass: https://github.com/open-compass/opencompass.

NeedleBench: LLMは100万トークンのコンテキストウィンドウで検索と推論が可能か？

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

要旨

Support