ChatPaper.aiChatPaper

相同任务,更多标记:输入长度对大型语言模型推理性能的影响。

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

February 19, 2024
作者: Mosh Levy, Alon Jacoby, Yoav Goldberg
cs.AI

摘要

本文探讨了扩展输入长度对大型语言模型(LLMs)能力的影响。尽管LLMs在最近取得了进展,但它们在不同输入长度下的性能一致性尚不明确。我们通过引入一种新颖的问答推理框架来研究这一方面,该框架专门设计用于评估输入长度的影响。我们通过使用同一样本的多个版本,每个版本都添加了不同长度、类型和位置的填充,来分离输入长度的影响。我们的研究结果显示,在远低于技术最大值的输入长度下,LLMs的推理性能出现明显下降。我们表明,尽管在不同程度上,这种下降趋势在我们数据集的每个版本中都存在。此外,我们的研究揭示了传统的困惑度指标与LLMs在长输入推理任务中的性能之间没有相关性。我们分析了我们的结果,并确定了可能作为未来研究有用指导的失败模式,潜在地为解决LLMs中观察到的限制提供策略。
English
This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs). Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs' reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities. Additionally, our study reveals that traditional perplexity metrics do not correlate with performance of LLMs' in long input reasoning tasks. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.
PDF196December 15, 2024