ChatPaper.aiChatPaper

針孔穿針:語言模型能否在接近百萬規模的「乾草堆」中追蹤線索?

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

November 7, 2024
作者: Jonathan Roberts, Kai Han, Samuel Albanie
cs.AI

摘要

隨著大型語言模型(LLMs)的上下文限制增加,可能應用和下游功能的範圍也擴大。在許多實際任務中,決策取決於散佈在通常包含大多數無關信息的不同文檔集合中的細節。長上下文LLMs似乎很適合這種複雜信息檢索和推理形式,這在傳統上被證明是昂貴且耗時的。然而,盡管近年來長上下文模型的發展取得了快速進展,我們對LLMs如何有效利用其上下文的理解卻沒有跟上。為了解決這個問題,我們進行了一系列檢索實驗,旨在評估17個領先的LLMs的能力,例如它們通過上下文窗口追踪信息串的能力。引人注目的是,我們發現許多模型在追踪信息串時表現出色:能夠同時跟踪多個信息串而性能損失不大。然而,對於許多模型來說,我們發現有效的上下文限制明顯短於支持的上下文長度,隨著上下文窗口的增長,準確性下降。我們的研究還突顯了一個重要觀點,即來自不同分詞器的標記計數不應直接進行比較--它們通常對應到完全不同數量的書寫字符。我們釋出我們的代碼和長上下文實驗數據。
English
As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared -- they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.
PDF223December 4, 2025