同樣的任務，更多的標記：輸入長度對大型語言模型推理表現的影響

摘要

本文探討延長輸入長度對大型語言模型（LLMs）能力的影響。儘管LLMs在近年取得了進展，但它們在不同輸入長度下的性能一致性尚未得到很好的理解。我們通過引入一個新穎的QA推理框架來研究這一方面，該框架專門設計用於評估輸入長度的影響。我們通過使用同一樣本的多個版本，每個版本都使用不同長度、類型和位置的填充來隔離輸入長度的影響。我們的研究結果顯示，LLMs在較短的輸入長度下的推理性能明顯下降，遠低於其技術最大值。我們展示了這種下降趨勢在我們數據集的每個版本中都出現，儘管強度不同。此外，我們的研究揭示傳統的困惑度指標與LLMs在長輸入推理任務中的性能之間沒有相關性。我們分析了結果並確定了可能作為未來研究有用指南的失敗模式，潛在地提供了解決LLMs觀察到的限制的策略。

English

This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs). Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs' reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities. Additionally, our study reveals that traditional perplexity metrics do not correlate with performance of LLMs' in long input reasoning tasks. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.

同樣的任務，更多的標記：輸入長度對大型語言模型推理表現的影響

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

摘要

Support