ExpertLongBench: 構造化チェックリストを用いた専門家レベルの長文生成タスクにおける言語モデルのベンチマーク

要旨

本論文では、現実的な専門家のワークフローとアプリケーションを反映した9分野11タスクからなるエキスパートレベルのベンチマーク「ExpertLongBench」を紹介する。ExpertLongBenchのアプリケーション駆動型タスクは、質問応答を超え、5,000トークンを超える長文出力と、分野固有の要件への厳密な準拠を要求する。特に、ExpertLongBenchの各タスクには、分野の専門家によって設計または検証されたルーブリックが含まれており、タスクの要件を明示し、出力評価をガイドする。さらに、本ベンチマークにおける長文モデル出力の正確な評価を支援する評価フレームワーク「CLEAR」を提案する。CLEARは、細粒度で専門家に沿った評価を実現するため、モデル出力と参照出力から、タスク固有のルーブリックの項目に対応する情報を抽出することでチェックリストを導出する。モデル出力のチェックリスト項目は、参照出力の対応する項目と比較され、その正確性が評価されることで、根拠に基づいた評価が可能となる。我々は11の大規模言語モデル（LLM）をベンチマークし、CLEARの構成要素を分析した結果、(1) 現存のLLMは、トップパフォーマーのF1スコアが26.8%に留まり、エキスパートレベルのタスクには大幅な改善が必要であること、(2) モデルは必要な側面に対応する内容を生成できるが、しばしば正確ではないこと、(3) CLEARにおける正確なチェックリスト抽出と比較は、オープンウェイトモデルによって達成可能であり、よりスケーラブルで低コストな使用が可能であることを示した。

English

This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items for model outputs are then compared with corresponding items for reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 11 large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer achieving only a 26.8% F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, though often not accurately; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable and low-cost usage.

ExpertLongBench: 構造化チェックリストを用いた専門家レベルの長文生成タスクにおける言語モデルのベンチマーク

ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists

要旨

Support