ChatPaper.aiChatPaper

序章:一个旨在要求全局理解与长上下文推理的基准测试

PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

August 13, 2025
作者: Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou
cs.AI

摘要

我们推出了PRELUDE基准,通过评估角色前传故事是否与原书正典叙事一致的任务,来衡量长上下文理解能力。相较于现有基准,我们的任务对全局理解和深度推理提出了更高要求——由于前传并非原故事的一部分,评估其合理性通常需要搜索并整合仅间接相关的信息。实证表明,88%的案例需要从叙事的多个部分寻找证据。实验结果凸显了该任务的挑战性:在上下文学习、检索增强生成(RAG)及采用最先进大语言模型进行领域内训练的情况下,以及商业深度研究服务,其表现均落后人类超过15%。进一步的人类研究表明,模型常以错误的推理得出正确答案,导致推理准确率与人类相比存在超过30%的差距。这些发现强调了在长上下文理解与推理方面仍有巨大的改进空间。
English
We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.
PDF121August 15, 2025