LongIns：一項針對LLM的具有挑戰性的長文本指令型考試

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

June 25, 2024

作者: Shawn Gavin, Tuney Zheng, Jiaheng Liu, Quehry Que, Noah Wang, Jian Yang, Chenchen Zhang, Wenhao Huang, Wenhu Chen, Ge Zhang

cs.AI

摘要

近年來，大型語言模型（LLMs）的長文本能力一直是熱門話題。為了評估LLMs在不同情境下的表現，出現了各種評估基準。然而，由於大多數這些基準著重於識別關鍵信息以回答問題，這主要需要LLMs的檢索能力，這些基準只能部分代表LLMs在大量信息中的推理表現。同時，儘管LLMs常聲稱具有32k、128k、200k甚至更長的上下文窗口，這些基準未能揭示這些LLMs實際支持的長度。為了應對這些問題，我們提出了LongIns基準數據集，這是一項具有挑戰性的基於指令的長文本考試，針對LLMs，它是基於現有指令數據集構建的。具體來說，在我們的LongIns中，我們引入了三種評估設置：全局指令和單一任務（GIST）、本地指令和單一任務（LIST）以及本地指令和多個任務（LIMT）。基於LongIns，我們對現有的LLMs進行全面評估，並得出以下重要發現：（1）性能最佳的具有128k上下文長度的GPT-4在我們的LongIns中對16k上下文窗口的評估表現不佳。（2）對於許多現有LLMs的多跳推理能力，在短上下文窗口（小於4k）下仍然需要大量努力。

English

The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global Instruction & Single Task (GIST), Local Instruction & Single Task (LIST), and Local Instruction & Multiple Tasks (LIMT). Based on LongIns, we perform comprehensive evaluations on existing LLMs and have the following important findings: (1). The top-performing GPT-4 with 128k context length performs poorly on the evaluation context window of 16k in our LongIns. (2). For the multi-hop reasoning ability of many existing LLMs, significant efforts are still needed under short context windows (less than 4k).