ChatPaper.aiChatPaper

针孔穿线:大型语言模型能否在近百万规模的“干草堆”中跟踪线索?

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

November 7, 2024
作者: Jonathan Roberts, Kai Han, Samuel Albanie
cs.AI

摘要

随着大型语言模型(LLMs)的上下文限制增加,可能应用和下游功能的范围也扩大了。在许多实际任务中,决策取决于分散在通常包含大量无关信息的文档集合中的细节。长上下文LLMs似乎非常适合这种复杂信息检索和推理形式,这在传统上往往耗时且费力。然而,尽管近年来长上下文模型的发展取得了快速进展,我们对LLMs如何有效利用其上下文的理解并没有跟上。为了解决这个问题,我们进行了一系列检索实验,旨在评估17个主要LLMs的能力,比如它们通过上下文窗口跟踪信息线索的能力。引人注目的是,我们发现许多模型在跟踪信息线索时表现出色:能够同时跟踪多个线索而不会显著降低性能。然而,对于许多模型来说,我们发现有效的上下文限制明显短于支持的上下文长度,随着上下文窗口的增长,准确性会下降。我们的研究还强调了一个重要观点,即来自不同分词器的标记计数不应直接进行比较--它们通常对应着大不相同的书面字符数。我们发布了我们的代码和长上下文实验数据。
English
As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared -- they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.
PDF223December 4, 2025