LOOM-Scope:一个全面高效的长上下文模型评估框架
LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework
July 7, 2025
作者: Zecheng Tang, Haitian Wang, Quantong Qiu, Baibei Ji, Ruoxi Sun, Keyan Zhou, Juntao Li, Min Zhang
cs.AI
摘要
长文本处理已成为大型语言模型(LLMs)的一项基本能力。为评估模型的长文本性能,众多长文本评估基准相继提出。然而,这些基准在评估设置上的差异导致了结果的不一致,使得进行可靠的比较变得困难。此外,长文本评估的高计算成本为社区全面评估长文本模型设置了重大障碍。本文中,我们提出了LOOM-Scope,一个全面且高效的长文本评估框架。LOOM-Scope统一了不同基准的评估设置,支持高效长文本推理加速方法的部署,并引入了一套全面而轻量级的基准测试集,以全方位评估模型。访问主页:https://loomscope.github.io
English
Long-context processing has become a fundamental capability for large
language models~(LLMs). To assess model's long-context performance, numerous
long-context evaluation benchmarks have been proposed. However, variations in
evaluation settings across these benchmarks lead to inconsistent results,
making it difficult to draw reliable comparisons. Besides, the high
computational cost of long-context evaluation poses a significant barrier for
the community to conduct comprehensive assessments of long-context models. In
this paper, we propose LOOM-Scope, a comprehensive and efficient framework for
long-context evaluation. LOOM-Scope standardizes evaluation settings across
diverse benchmarks, supports deployment of efficient long-context inference
acceleration methods, and introduces a holistic yet lightweight benchmark suite
to evaluate models comprehensively. Homepage: https://loomscope.github.io