評估語言模型上下文窗口：一項「工作記憶」測試和推論時間修正

摘要

大型語言模型在現實應用中被廣泛使用，通常負責對大量文件進行推理。在這個領域中一個令人振奮的發展是具有擴展上下文能力的模型，有些模型可以容納超過 2 百萬個標記。這種長上下文模型的能力在生產系統中仍存在不確定性，這促使我們有必要在真實用例中對它們的性能進行基準測試。我們通過提出 SWiM 來應對這一挑戰，這是一個解決標準測試限制的評估框架。在八個長上下文模型上測試這個框架後，我們發現即使像 GPT-4 和 Claude 3 Opus 這樣的強大模型，在上下文窗口中間存在信息時性能也會下降（中間遺失效應）。接下來，除了我們的基準測試，我們提出了中位數投票，這是一種簡單但有效的無需訓練的方法，有助於緩解這種效應，方法是生成幾次回應，每次都對上下文中的文件進行隨機排列，並選擇中位數答案。我們在單文檔問答任務上評估中位數投票，實現了高達 24% 的準確率提升。

English

Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents. An exciting development in this space is models boasting extended context capabilities, with some accommodating over 2 million tokens. Such long context model capabilities remain uncertain in production systems, motivating the need to benchmark their performance on real world use cases. We address this challenge by proposing SWiM, an evaluation framework that addresses the limitations of standard tests. Testing the framework on eight long context models, we find that even strong models such as GPT-4 and Claude 3 Opus degrade in performance when information is present in the middle of the context window (lost-in-the-middle effect). Next, in addition to our benchmark, we propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect, by generating responses a few times, each time randomly permuting documents in the context, and selecting the medoid answer. We evaluate medoid voting on single document QA tasks, achieving up to a 24% lift in accuracy.

評估語言模型上下文窗口：一項「工作記憶」測試和推論時間修正

Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction

摘要

Support