LongIns：一项针对LLM的具有挑战性的长上下文指令型考试

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

June 25, 2024

作者: Shawn Gavin, Tuney Zheng, Jiaheng Liu, Quehry Que, Noah Wang, Jian Yang, Chenchen Zhang, Wenhao Huang, Wenhu Chen, Ge Zhang

cs.AI

摘要

近年来，大型语言模型（LLMs）的长文本能力一直是热门话题。为了评估LLMs在不同场景下的性能，出现了各种评估基准。然而，由于大多数这些基准侧重于识别关键信息以回答问题，主要需要LLMs的检索能力，这些基准只能部分代表LLMs在大量信息中的推理性能。同时，尽管LLMs经常声称具有32k、128k、200k甚至更长的上下文窗口，但这些基准未能揭示这些LLMs实际支持的长度。为了解决这些问题，我们提出了LongIns基准数据集，这是一个具有挑战性的基于指令的长文本考试，专为LLMs设计，建立在现有指令数据集的基础上。具体来说，在我们的LongIns中，我们引入了三种评估设置：全局指令和单一任务（GIST）、局部指令和单一任务（LIST）以及局部指令和多任务（LIMT）。基于LongIns，我们对现有LLMs进行全面评估，并得出以下重要发现：（1）性能最佳的GPT-4在128k上下文长度下在我们的LongIns中的评估上下文窗口为16k时表现不佳。（2）对于许多现有LLMs的多跳推理能力，在短上下文窗口（小于4k）下仍需要大量努力。

English

The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global Instruction & Single Task (GIST), Local Instruction & Single Task (LIST), and Local Instruction & Multiple Tasks (LIMT). Based on LongIns, we perform comprehensive evaluations on existing LLMs and have the following important findings: (1). The top-performing GPT-4 with 128k context length performs poorly on the evaluation context window of 16k in our LongIns. (2). For the multi-hop reasoning ability of many existing LLMs, significant efforts are still needed under short context windows (less than 4k).