言語モデルのコンテキストウィンドウ評価：「ワーキングメモリ」テストと推論時補正

要旨

大規模言語モデルは現実世界のアプリケーションで広く使用されており、大量の文書に対する推論を担うことが多い。この分野での注目すべき進展は、拡張されたコンテキスト能力を備えたモデルであり、一部は200万トークン以上を処理できる。しかし、このような長いコンテキストモデルの能力は、実際の生産システムにおいてまだ不確かであり、現実世界のユースケースでの性能をベンチマークする必要性が生じている。この課題に対処するため、我々は標準的なテストの限界を克服する評価フレームワーク「SWiM」を提案する。8つの長いコンテキストモデルに対してこのフレームワークをテストした結果、GPT-4やClaude 3 Opusのような強力なモデルでも、コンテキストウィンドウの中央に情報がある場合に性能が低下する（lost-in-the-middle効果）ことが明らかになった。次に、このベンチマークに加えて、我々はmedoid votingというシンプルだが効果的なトレーニング不要のアプローチを提案する。このアプローチでは、コンテキスト内の文書をランダムに並べ替えて複数回応答を生成し、その中からmedoid（中央値）となる回答を選択することで、この効果を緩和する。我々は、単一文書のQAタスクにおいてmedoid votingを評価し、最大24%の精度向上を達成した。

English

Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents. An exciting development in this space is models boasting extended context capabilities, with some accommodating over 2 million tokens. Such long context model capabilities remain uncertain in production systems, motivating the need to benchmark their performance on real world use cases. We address this challenge by proposing SWiM, an evaluation framework that addresses the limitations of standard tests. Testing the framework on eight long context models, we find that even strong models such as GPT-4 and Claude 3 Opus degrade in performance when information is present in the middle of the context window (lost-in-the-middle effect). Next, in addition to our benchmark, we propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect, by generating responses a few times, each time randomly permuting documents in the context, and selecting the medoid answer. We evaluate medoid voting on single document QA tasks, achieving up to a 24% lift in accuracy.

言語モデルのコンテキストウィンドウ評価：「ワーキングメモリ」テストと推論時補正

Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction

要旨

Support