评估语言模型上下文窗口：一项“工作记忆”测试和推理时间校正

摘要

大型语言模型在现实世界的应用中被广泛使用，通常被要求对大量文档进行推理。在这一领域的一个令人兴奋的发展是具有扩展上下文能力的模型，其中一些模型可以容纳超过 2 百万个标记。这种长上下文模型的能力在生产系统中仍存在不确定性，这促使我们有必要在真实用例中对它们的性能进行基准测试。我们通过提出 SWiM 来解决这一挑战，这是一个解决标准测试限制的评估框架。在八个长上下文模型上测试该框架后，我们发现即使是强大的模型如 GPT-4 和 Claude 3 Opus，在上下文窗口中间存在信息时性能也会下降（中间丢失效应）。接下来，除了我们的基准测试，我们提出了中值投票，这是一种简单但有效的无需训练的方法，可以帮助缓解这种效应，方法是生成几次回答，每次随机排列上下文中的文档，并选择中值答案。我们在单文档问答任务上评估了中值投票，准确率提高了高达 24%。

English

Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents. An exciting development in this space is models boasting extended context capabilities, with some accommodating over 2 million tokens. Such long context model capabilities remain uncertain in production systems, motivating the need to benchmark their performance on real world use cases. We address this challenge by proposing SWiM, an evaluation framework that addresses the limitations of standard tests. Testing the framework on eight long context models, we find that even strong models such as GPT-4 and Claude 3 Opus degrade in performance when information is present in the middle of the context window (lost-in-the-middle effect). Next, in addition to our benchmark, we propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect, by generating responses a few times, each time randomly permuting documents in the context, and selecting the medoid answer. We evaluate medoid voting on single document QA tasks, achieving up to a 24% lift in accuracy.

评估语言模型上下文窗口：一项“工作记忆”测试和推理时间校正

Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction

摘要

Support