ChatPaper.aiChatPaper

ProfBench:需要專業知識來回答與評判的多領域評量標準

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

October 21, 2025
作者: Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong
cs.AI

摘要

評估大型語言模型(LLMs)的進展常受限於驗證回應的挑戰,這使得評估僅能侷限於數學、程式設計及簡短問答等任務。然而,許多實際應用場景要求評估LLMs在處理專業文件、整合資訊以及生成全面報告以回應使用者查詢的能力。我們推出了ProfBench:一套包含超過7000個由具備專業知識的人類專家(涵蓋物理學博士、化學博士、財務MBA及諮詢MBA)評估的回應-標準配對。我們構建了既穩健又經濟的LLM評判者來評估ProfBench的評分標準,通過減輕自我提升偏見並將評估成本降低2至3個數量級,使其對更廣泛的社群而言既公平又可及。我們的研究發現,ProfBench對即便是最先進的LLMs也構成了重大挑戰,表現最佳的模型如GPT-5-high的整體表現僅達65.9%。此外,我們觀察到專有模型與開源權重模型之間存在顯著的性能差距,並深入探討了延伸思維在解決複雜專業領域任務中所扮演的角色。數據來源:https://huggingface.co/datasets/nvidia/ProfBench 及代碼:https://github.com/NVlabs/ProfBench
English
Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9\% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench
PDF21October 23, 2025