동일한 작업, 더 많은 토큰: 입력 길이가 대형 언어 모델의 추론 성능에 미치는 영향

초록

본 논문은 입력 길이 확장이 대규모 언어 모델(LLM)의 능력에 미치는 영향을 탐구합니다. 최근 LLM의 발전에도 불구하고, 다양한 입력 길이에 걸친 성능 일관성은 잘 이해되지 않고 있습니다. 우리는 입력 길이의 영향을 평가하기 위해 특별히 설계된 새로운 QA 추론 프레임워크를 도입하여 이 측면을 조사합니다. 동일한 샘플의 여러 버전을 사용하여 입력 길이의 효과를 분리하며, 각 버전은 길이, 유형 및 위치가 다른 패딩으로 확장됩니다. 우리의 연구 결과는 LLM의 추론 성능이 기술적 최대치보다 훨씬 짧은 입력 길이에서도 현저히 저하됨을 보여줍니다. 이 저하 경향은 데이터셋의 모든 버전에서 나타나지만, 그 강도는 다릅니다. 또한, 우리의 연구는 전통적인 perplexity 지표가 긴 입력 추론 작업에서의 LLM 성능과 상관관계가 없음을 밝혀냅니다. 우리는 결과를 분석하고 LLM의 한계를 해결하기 위한 전략에 유용한 지침이 될 수 있는 실패 모드를 식별합니다.

English

This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs). Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs' reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities. Additionally, our study reveals that traditional perplexity metrics do not correlate with performance of LLMs' in long input reasoning tasks. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.

동일한 작업, 더 많은 토큰: 입력 길이가 대형 언어 모델의 추론 성능에 미치는 영향

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

초록

Support