LOOM-Scope: 장문맥 모델 평가를 위한 포괄적이고 효율적인 프레임워크

초록

장문맥 처리 능력은 대규모 언어 모델(LLMs)의 기본적인 역량으로 자리 잡았습니다. 모델의 장문맥 성능을 평가하기 위해 다양한 장문맥 평가 벤치마크가 제안되었습니다. 그러나 이러한 벤치마크 간의 평가 설정 차이로 인해 일관되지 않은 결과가 도출되어, 신뢰할 수 있는 비교를 수행하기 어려운 상황입니다. 또한, 장문맥 평가의 높은 계산 비용은 커뮤니티가 장문맥 모델을 포괄적으로 평가하는 데 있어 상당한 장벽으로 작용하고 있습니다. 본 논문에서는 장문맥 평가를 위한 포괄적이고 효율적인 프레임워크인 LOOM-Scope를 제안합니다. LOOM-Scope는 다양한 벤치마크 간의 평가 설정을 표준화하고, 효율적인 장문맥 추론 가속 방법의 배포를 지원하며, 모델을 포괄적으로 평가하기 위한 가볍고도 종합적인 벤치마크 스위트를 도입합니다. 홈페이지: https://loomscope.github.io

English

Long-context processing has become a fundamental capability for large language models~(LLMs). To assess model's long-context performance, numerous long-context evaluation benchmarks have been proposed. However, variations in evaluation settings across these benchmarks lead to inconsistent results, making it difficult to draw reliable comparisons. Besides, the high computational cost of long-context evaluation poses a significant barrier for the community to conduct comprehensive assessments of long-context models. In this paper, we propose LOOM-Scope, a comprehensive and efficient framework for long-context evaluation. LOOM-Scope standardizes evaluation settings across diverse benchmarks, supports deployment of efficient long-context inference acceleration methods, and introduces a holistic yet lightweight benchmark suite to evaluate models comprehensively. Homepage: https://loomscope.github.io

LOOM-Scope: 장문맥 모델 평가를 위한 포괄적이고 효율적인 프레임워크

LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework

초록

Support