LOOM-Scope: 長文脈モデル評価のための包括的かつ効率的なフレームワーク

要旨

長文脈処理は、大規模言語モデル（LLM）にとって基本的な能力となっています。モデルの長文脈性能を評価するために、多くの長文脈評価ベンチマークが提案されています。しかし、これらのベンチマーク間での評価設定の違いにより、一貫性のない結果が生じ、信頼性のある比較が困難になっています。さらに、長文脈評価の高い計算コストは、コミュニティが長文脈モデルを包括的に評価する上で大きな障壁となっています。本論文では、LOOM-Scopeという包括的かつ効率的な長文脈評価フレームワークを提案します。LOOM-Scopeは、多様なベンチマーク間での評価設定を標準化し、効率的な長文脈推論加速手法の導入をサポートし、包括的かつ軽量なベンチマークスイートを導入してモデルを総合的に評価します。ホームページ: https://loomscope.github.io

English

Long-context processing has become a fundamental capability for large language models~(LLMs). To assess model's long-context performance, numerous long-context evaluation benchmarks have been proposed. However, variations in evaluation settings across these benchmarks lead to inconsistent results, making it difficult to draw reliable comparisons. Besides, the high computational cost of long-context evaluation poses a significant barrier for the community to conduct comprehensive assessments of long-context models. In this paper, we propose LOOM-Scope, a comprehensive and efficient framework for long-context evaluation. LOOM-Scope standardizes evaluation settings across diverse benchmarks, supports deployment of efficient long-context inference acceleration methods, and introduces a holistic yet lightweight benchmark suite to evaluate models comprehensively. Homepage: https://loomscope.github.io

LOOM-Scope: 長文脈モデル評価のための包括的かつ効率的なフレームワーク

LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework

要旨

Support