ChatPaper.aiChatPaper

ExpertLongBench:基于结构化检查表的专家级长文本生成任务语言模型基准测试

ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists

June 2, 2025
作者: Jie Ruan, Inderjeet Nair, Shuyang Cao, Amy Liu, Sheza Munir, Micah Pollens-Dempsey, Tiffany Chiang, Lucy Kates, Nicholas David, Sihan Chen, Ruxin Yang, Yuqian Yang, Jasmine Gump, Tessa Bialek, Vivek Sankaran, Margo Schlanger, Lu Wang
cs.AI

摘要

本文介绍了ExpertLongBench,这是一个专家级基准测试,包含来自9个领域的11项任务,这些任务反映了真实的专家工作流程和应用场景。除了问答任务外,ExpertLongBench中的应用驱动型任务要求生成超过5,000个标记的长篇输出,并严格遵守特定领域的要求。值得注意的是,ExpertLongBench中的每项任务都包含一个由领域专家设计或验证的评分标准,用以明确任务要求并指导输出评估。此外,我们提出了CLEAR评估框架,该框架支持在我们的基准测试中对长篇模型输出进行准确评估。为了实现细粒度、与专家对齐的评估,CLEAR通过从模型输出和参考文本中提取与任务特定评分标准项对应的信息,生成检查清单。随后,将模型输出的检查清单项与参考输出的相应项进行比较,以评估其正确性,从而实现有依据的评估。我们对11个大型语言模型(LLMs)进行了基准测试,并分析了CLEAR中的组件,结果表明:(1)现有LLMs在专家级任务上仍需显著改进,表现最佳的模型仅达到26.8%的F1分数;(2)模型能够生成与所需方面相对应的内容,但往往不够准确;(3)CLEAR中准确的检查清单提取和比较可以通过开源权重模型实现,从而实现更可扩展和低成本的用途。
English
This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items for model outputs are then compared with corresponding items for reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 11 large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer achieving only a 26.8% F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, though often not accurately; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable and low-cost usage.
PDF92June 10, 2025