PRELUDE: 長文脈にわたるグローバルな理解と推論を必要とするように設計されたベンチマーク

要旨

PRELUDEを紹介します。これは、キャラクターの前日譚が原作の正典的物語と整合性があるかどうかを判断するタスクを通じて、長文脈理解を評価するためのベンチマークです。本タスクは、既存のベンチマークよりもグローバルな理解と深い推論を強く要求します。前日譚は原作の一部ではないため、その妥当性を評価するには、間接的に関連する情報を検索し統合する必要があるからです。実際、88%の事例で物語の複数部分からの証拠が必要です。実験結果は本タスクの難しさを示しています。最先端の大規模言語モデル（LLM）を用いたインコンテキスト学習、RAG、ドメイン固有のトレーニング、および商用のDeepResearchサービスは、人間の性能に比べて15%以上遅れています。さらに人間による調査では、モデルが正しい答えを出しながらも誤った推論を行うことが多く、推論精度において人間と比べて30%以上のギャップがあることが明らかになりました。これらの発見は、長文脈理解と推論において改善の余地が大きいことを示しています。

English

We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.

PRELUDE: 長文脈にわたるグローバルな理解と推論を必要とするように設計されたベンチマーク

PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

要旨

Support