긴 문맥을 처리하는 LLM은 긴 문맥 내 학습에 어려움을 겪는다

초록

대형 언어 모델(LLMs)은 32K 토큰을 초과하는 긴 시퀀스를 처리하는 데 있어 상당한 진전을 이루었습니다. 그러나 이들의 성능 평가는 주로 perplexity와 합성 작업과 같은 지표에 국한되어 있어, 더 세밀하고 실제적인 시나리오에서의 능력을 충분히 반영하지 못할 수 있습니다. 본 연구는 극단적인 라벨 분류 영역에서의 장기 문맥 학습에 초점을 맞춘 특화된 벤치마크(LIConBench)를 소개합니다. 우리는 28개에서 174개에 이르는 라벨 범위를 가지며, 2K에서 50K까지 다양한 입력(소수 샷 데모) 길이를 포함하는 6개의 데이터셋을 신중하게 선정했습니다. 우리의 벤치마크는 LLMs가 방대한 라벨 공간을 인식하고 올바른 예측을 하기 위해 전체 입력을 이해할 것을 요구합니다. 우리는 13개의 장기 문맥 LLMs를 이 벤치마크에서 평가했습니다. 그 결과, 20K 토큰 길이 이하에서는 장기 문맥 LLMs가 비교적 잘 수행되며, 긴 문맥 창을 활용함으로써 성능이 향상되는 것을 확인했습니다. 그러나 문맥 창이 20K를 초과하면 GPT-4를 제외한 대부분의 LLMs의 성능이 급격히 하락합니다. 이는 현재 LLMs가 길고 문맥이 풍부한 시퀀스를 처리하고 이해하는 데 있어 상당한 격차가 있음을 시사합니다. 추가 분석 결과, 모델들이 시퀀스의 끝 부분에 제시된 라벨에 대한 예측을 선호하는 경향이 있음이 밝혀졌습니다. 이들은 긴 시퀀스 내 여러 부분에 대해 추론하는 능력이 아직 개선될 필요가 있습니다. 우리의 연구는 장기 문맥 이해와 추론이 기존 LLMs에게 여전히 어려운 과제임을 보여줍니다. 우리는 LIConBench가 향후 장기 문맥 LLMs에 대한 더 현실적인 평가 도구로 활용될 수 있을 것이라 믿습니다.

English

Large Language Models (LLMs) have made significant strides in handling long sequences exceeding 32K tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their abilities in more nuanced, real-world scenarios. This study introduces a specialized benchmark (LIConBench) focusing on long in-context learning within the realm of extreme-label classification. We meticulously selected six datasets with a label range spanning 28 to 174 classes covering different input (few-shot demonstration) length from 2K to 50K. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct prediction. We evaluate 13 long-context LLMs on our benchmarks. We find that the long-context LLMs perform relatively well under the token length of 20K and the performance benefits from utilizing the long context window. However, after the context window exceeds 20K, most LLMs except GPT-4 will dip dramatically. This suggests a notable gap in current LLM capabilities for processing and understanding long, context-rich sequences. Further analysis revealed a tendency among models to favor predictions for labels presented towards the end at the sequence. Their ability to reason over multiple pieces in the long sequence is yet to be improved. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LIConBench could serve as a more realistic evaluation for the future long context LLMs.

긴 문맥을 처리하는 LLM은 긴 문맥 내 학습에 어려움을 겪는다

Long-context LLMs Struggle with Long In-context Learning

초록

Support