長文本上下文語言模型在長篇內容學習方面遇到困難。
Long-context LLMs Struggle with Long In-context Learning
April 2, 2024
作者: Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen
cs.AI
摘要
大型語言模型(LLMs)在處理超過32K標記的長序列方面取得了顯著進展。然而,它們的性能評估主要僅限於困惑度和合成任務等指標,這可能無法完全捕捉它們在更微妙、現實世界情境中的能力。本研究引入了一個專門的基準(LIConBench),專注於極端標籤分類領域內的長上下文學習。我們精心挑選了六個資料集,標籤範圍涵蓋28至174個類別,涵蓋了不同輸入(少樣本演示)長度,從2K到50K不等。我們的基準要求LLMs理解整個輸入以識別龐大的標籤空間,以做出正確的預測。我們在我們的基準上評估了13個長上下文LLMs。我們發現,在標記長度為20K以下時,長上下文LLMs表現相對良好,並且從利用長上下文窗口中受益。然而,當上下文窗口超過20K後,除了GPT-4外,大多數LLMs的性能會急劇下降。這表明了目前LLMs在處理和理解長、上下文豐富序列方面存在顯著差距。進一步分析顯示,模型傾向於偏好對序列末尾呈現的標籤進行預測。它們在長序列中推理多個部分的能力仍有待改進。我們的研究顯示,現有LLMs對長上下文的理解和推理仍然是一項具有挑戰性的任務。我們認為LIConBench可能成為未來長上下文LLMs更現實的評估基準。
English
Large Language Models (LLMs) have made significant strides in handling long
sequences exceeding 32K tokens. However, their performance evaluation has
largely been confined to metrics like perplexity and synthetic tasks, which may
not fully capture their abilities in more nuanced, real-world scenarios. This
study introduces a specialized benchmark (LIConBench) focusing on long
in-context learning within the realm of extreme-label classification. We
meticulously selected six datasets with a label range spanning 28 to 174
classes covering different input (few-shot demonstration) length from 2K to
50K. Our benchmark requires LLMs to comprehend the entire input to recognize
the massive label spaces to make correct prediction. We evaluate 13
long-context LLMs on our benchmarks. We find that the long-context LLMs perform
relatively well under the token length of 20K and the performance benefits from
utilizing the long context window. However, after the context window exceeds
20K, most LLMs except GPT-4 will dip dramatically. This suggests a notable gap
in current LLM capabilities for processing and understanding long, context-rich
sequences. Further analysis revealed a tendency among models to favor
predictions for labels presented towards the end at the sequence. Their ability
to reason over multiple pieces in the long sequence is yet to be improved. Our
study reveals that long context understanding and reasoning is still a
challenging task for the existing LLMs. We believe LIConBench could serve as a
more realistic evaluation for the future long context LLMs.Summary
AI-Generated Summary