NoLiMa:超越字面匹配的長文本評估
NoLiMa: Long-Context Evaluation Beyond Literal Matching
February 7, 2025
作者: Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze
cs.AI
摘要
近期的大型語言模型(LLMs)支持範圍從128K到1M個標記的長文本。評估這些能力的一種流行方法是針對一堆草堆(NIAH)測試,其中涉及從“草堆”(冗長無關的內容)檢索“針”(相關信息)。這種方法的延伸包括增加干擾項、事實鏈和上下文推理。然而,在這些基準測試中,模型可能會利用針和草堆之間現有的文字匹配來簡化任務。為了應對這一問題,我們引入了NoLiMa,這是一個通過精心設計針對集合的基準測試,其中問題和針之間的詞彙重疊極小,需要模型推斷潛在聯繫以定位針在草堆中的位置。我們評估了聲稱支持至少128K標記上下文的12個流行LLMs。儘管它們在短上下文(<1K)中表現良好,但隨著上下文長度的增加,性能明顯下降。例如,在32K時,有10個模型下降到低於其強短長度基準的50%。即使是表現最佳的例外之一GPT-4o,也從近乎完美的基準99.3%下降到69.7%。我們的分析表明,這些下降是由於當字面匹配不存在時,注意機制在更長的上下文中面臨的困難增加,使得檢索相關信息變得更加困難。
English
Recent large language models (LLMs) support long contexts ranging from 128K
to 1M tokens. A popular method for evaluating these capabilities is the
needle-in-a-haystack (NIAH) test, which involves retrieving a "needle"
(relevant information) from a "haystack" (long irrelevant context). Extensions
of this approach include increasing distractors, fact chaining, and in-context
reasoning. However, in these benchmarks, models can exploit existing literal
matches between the needle and haystack to simplify the task. To address this,
we introduce NoLiMa, a benchmark extending NIAH with a carefully designed
needle set, where questions and needles have minimal lexical overlap, requiring
models to infer latent associations to locate the needle within the haystack.
We evaluate 12 popular LLMs that claim to support contexts of at least 128K
tokens. While they perform well in short contexts (<1K), performance degrades
significantly as context length increases. At 32K, for instance, 10 models drop
below 50% of their strong short-length baselines. Even GPT-4o, one of the
top-performing exceptions, experiences a reduction from an almost-perfect
baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the
increased difficulty the attention mechanism faces in longer contexts when
literal matches are absent, making it harder to retrieve relevant information.Summary
AI-Generated Summary