ExpertLongBench:基於結構化檢查表的專家級長文本生成任務語言模型基準測試
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists
June 2, 2025
作者: Jie Ruan, Inderjeet Nair, Shuyang Cao, Amy Liu, Sheza Munir, Micah Pollens-Dempsey, Tiffany Chiang, Lucy Kates, Nicholas David, Sihan Chen, Ruxin Yang, Yuqian Yang, Jasmine Gump, Tessa Bialek, Vivek Sankaran, Margo Schlanger, Lu Wang
cs.AI
摘要
本文介紹了ExpertLongBench,這是一個專家級別的基準測試,包含來自9個領域的11項任務,這些任務反映了現實中的專家工作流程和應用。除了問答之外,ExpertLongBench中的應用驅動型任務要求生成超過5,000個token的長篇輸出,並嚴格遵守特定領域的要求。值得注意的是,ExpertLongBench中的每項任務都包含一個由領域專家設計或驗證的評分標準,用以明確任務要求並指導輸出評估。此外,我們提出了CLEAR,這是一個支持在我們的基準測試中對長篇模型輸出進行準確評估的框架。為了實現細粒度、與專家對齊的評估,CLEAR通過提取與任務特定評分標準中項目相對應的信息,從模型輸出和參考文獻中生成檢查表。然後將模型輸出的檢查表項目與參考輸出的相應項目進行比較,以評估其正確性,從而實現有依據的評估。我們對11個大型語言模型(LLMs)進行了基準測試,並分析了CLEAR中的組件,結果顯示:(1)現有的LLMs在專家級任務上需要顯著改進,表現最佳的模型僅達到26.8%的F1分數;(2)模型能夠生成與所需方面相對應的內容,但往往不夠準確;(3)CLEAR中的準確檢查表提取和比較可以通過開源模型實現,從而實現更可擴展和低成本的使用。
English
This paper introduces ExpertLongBench, an expert-level benchmark containing
11 tasks from 9 domains that reflect realistic expert workflows and
applications. Beyond question answering, the application-driven tasks in
ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and
strict adherence to domain-specific requirements. Notably, each task in
ExpertLongBench includes a rubric, designed or validated by domain experts, to
specify task requirements and guide output evaluation. Furthermore, we propose
CLEAR, an evaluation framework that supports accurate evaluation of long-form
model outputs in our benchmark. To achieve fine-grained, expert-aligned
evaluation, CLEAR derives checklists from both model outputs and references by
extracting information corresponding to items in the task-specific rubric.
Checklist items for model outputs are then compared with corresponding items
for reference outputs to assess their correctness, enabling grounded
evaluation. We benchmark 11 large language models (LLMs) and analyze components
in CLEAR, showing that (1) existing LLMs, with the top performer achieving only
a 26.8% F1 score, require significant improvement for expert-level tasks; (2)
models can generate content corresponding to the required aspects, though often
not accurately; and (3) accurate checklist extraction and comparison in CLEAR
can be achieved by open-weight models for more scalable and low-cost usage.