长上下文LLM在长上下文学习中表现不佳
Long-context LLMs Struggle with Long In-context Learning
April 2, 2024
作者: Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen
cs.AI
摘要
大型语言模型(LLMs)在处理超过32K标记的长序列方面取得了显著进展。然而,其性能评估主要局限于困惑度等指标和合成任务,这些可能无法全面反映其在更为复杂、真实世界场景中的能力。本研究引入了一个专门基准(LIConBench),聚焦于极端标签分类领域内的长上下文学习。我们精心挑选了六个数据集,标签范围从28到174类不等,涵盖了从2K到50K的不同输入(少样本演示)长度。我们的基准要求LLMs理解整个输入,以识别庞大的标签空间并做出正确预测。我们在基准上评估了13个长上下文LLMs。研究发现,在20K标记长度下,长上下文LLMs表现相对良好,且利用长上下文窗口带来了性能提升。然而,当上下文窗口超过20K后,除GPT-4外的多数LLMs性能急剧下降。这表明当前LLM在处理和理解长而丰富的上下文序列方面存在显著差距。进一步分析显示,模型倾向于对序列末尾出现的标签进行预测,其在长序列中对多个片段进行推理的能力仍有待提升。我们的研究表明,长上下文的理解和推理对现有LLMs仍是一项艰巨任务。我们相信LIConBench能为未来长上下文LLMs提供更为现实的评估。
English
Large Language Models (LLMs) have made significant strides in handling long
sequences exceeding 32K tokens. However, their performance evaluation has
largely been confined to metrics like perplexity and synthetic tasks, which may
not fully capture their abilities in more nuanced, real-world scenarios. This
study introduces a specialized benchmark (LIConBench) focusing on long
in-context learning within the realm of extreme-label classification. We
meticulously selected six datasets with a label range spanning 28 to 174
classes covering different input (few-shot demonstration) length from 2K to
50K. Our benchmark requires LLMs to comprehend the entire input to recognize
the massive label spaces to make correct prediction. We evaluate 13
long-context LLMs on our benchmarks. We find that the long-context LLMs perform
relatively well under the token length of 20K and the performance benefits from
utilizing the long context window. However, after the context window exceeds
20K, most LLMs except GPT-4 will dip dramatically. This suggests a notable gap
in current LLM capabilities for processing and understanding long, context-rich
sequences. Further analysis revealed a tendency among models to favor
predictions for labels presented towards the end at the sequence. Their ability
to reason over multiple pieces in the long sequence is yet to be improved. Our
study reveals that long context understanding and reasoning is still a
challenging task for the existing LLMs. We believe LIConBench could serve as a
more realistic evaluation for the future long context LLMs.Summary
AI-Generated Summary