LOOM-Scope：一个全面高效的长上下文模型评估框架

摘要

长文本处理已成为大型语言模型（LLMs）的一项基本能力。为评估模型的长文本性能，众多长文本评估基准相继提出。然而，这些基准在评估设置上的差异导致了结果的不一致，使得进行可靠的比较变得困难。此外，长文本评估的高计算成本为社区全面评估长文本模型设置了重大障碍。本文中，我们提出了LOOM-Scope，一个全面且高效的长文本评估框架。LOOM-Scope统一了不同基准的评估设置，支持高效长文本推理加速方法的部署，并引入了一套全面而轻量级的基准测试集，以全方位评估模型。访问主页：https://loomscope.github.io

English

Long-context processing has become a fundamental capability for large language models~(LLMs). To assess model's long-context performance, numerous long-context evaluation benchmarks have been proposed. However, variations in evaluation settings across these benchmarks lead to inconsistent results, making it difficult to draw reliable comparisons. Besides, the high computational cost of long-context evaluation poses a significant barrier for the community to conduct comprehensive assessments of long-context models. In this paper, we propose LOOM-Scope, a comprehensive and efficient framework for long-context evaluation. LOOM-Scope standardizes evaluation settings across diverse benchmarks, supports deployment of efficient long-context inference acceleration methods, and introduces a holistic yet lightweight benchmark suite to evaluate models comprehensively. Homepage: https://loomscope.github.io

LOOM-Scope：一个全面高效的长上下文模型评估框架

LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework

摘要

Support