序章:一個旨在要求全局理解與長上下文推理的基準測試
PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts
August 13, 2025
作者: Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou
cs.AI
摘要
我們推出PRELUDE這一基準,旨在通過判斷角色的前傳故事是否與原著的正統敘事一致來評估長上下文理解能力。與現有基準相比,我們的任務對全局理解和深度推理提出了更高要求——由於前傳並非原著的一部分,評估其合理性通常需要搜索並整合僅間接相關的信息。實證表明,88%的案例需要從敘事的多個部分獲取證據。實驗結果凸顯了我們任務的挑戰性:在上下文學習、檢索增強生成(RAG)及使用最先進的大型語言模型(LLMs)進行領域內訓練,以及商業深度研究服務中,這些方法均落後於人類表現超過15%。進一步的人類研究揭示,模型經常在推理過程存在缺陷的情況下給出正確答案,導致其推理準確性與人類相比存在超過30%的差距。這些發現強調了在長上下文理解和推理方面仍有巨大的改進空間。
English
We introduce PRELUDE, a benchmark for evaluating long-context understanding
through the task of determining whether a character's prequel story is
consistent with the canonical narrative of the original book. Our task poses a
stronger demand for global comprehension and deep reasoning than existing
benchmarks -- as the prequels are not part of the original story, assessing
their plausibility typically requires searching and integrating information
that is only indirectly related. Empirically, 88% of instances require evidence
from multiple parts of the narrative. Experimental results highlight the
challenge of our task: in-context learning, RAG and in-domain training with
state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans
by >15%. A further human study reveals that models often produce correct
answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy
compared to humans. These findings underscore the substantial room for
improvement in long-context understanding and reasoning.