ExpertLongBench：基於結構化檢查表的專家級長文本生成任務語言模型基準測試

摘要

本文介紹了ExpertLongBench，這是一個專家級別的基準測試，包含來自9個領域的11項任務，這些任務反映了現實中的專家工作流程和應用。除了問答之外，ExpertLongBench中的應用驅動型任務要求生成超過5,000個token的長篇輸出，並嚴格遵守特定領域的要求。值得注意的是，ExpertLongBench中的每項任務都包含一個由領域專家設計或驗證的評分標準，用以明確任務要求並指導輸出評估。此外，我們提出了CLEAR，這是一個支持在我們的基準測試中對長篇模型輸出進行準確評估的框架。為了實現細粒度、與專家對齊的評估，CLEAR通過提取與任務特定評分標準中項目相對應的信息，從模型輸出和參考文獻中生成檢查表。然後將模型輸出的檢查表項目與參考輸出的相應項目進行比較，以評估其正確性，從而實現有依據的評估。我們對11個大型語言模型（LLMs）進行了基準測試，並分析了CLEAR中的組件，結果顯示：（1）現有的LLMs在專家級任務上需要顯著改進，表現最佳的模型僅達到26.8%的F1分數；（2）模型能夠生成與所需方面相對應的內容，但往往不夠準確；（3）CLEAR中的準確檢查表提取和比較可以通過開源模型實現，從而實現更可擴展和低成本的使用。

English

This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items for model outputs are then compared with corresponding items for reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 11 large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer achieving only a 26.8% F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, though often not accurately; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable and low-cost usage.

ExpertLongBench：基於結構化檢查表的專家級長文本生成任務語言模型基準測試

ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists

摘要

Support