ProfBench:需专业知识作答与评判的多领域评分标准
ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
October 21, 2025
作者: Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong
cs.AI
摘要
评估大型语言模型(LLMs)的进展常受限于验证响应的挑战,导致评估多局限于数学、编程及简短问答等任务。然而,众多实际应用场景要求LLMs能够处理专业文档、整合信息并针对用户查询生成详尽报告。为此,我们推出了ProfBench:一套包含7000余项由具备物理学博士、化学博士、金融MBA及咨询MBA专业背景的人类专家评估的响应-标准对。我们构建了稳健且经济高效的LLM评判体系,通过减轻自我提升偏差并将评估成本降低2至3个数量级,使ProfBench的评估标准公平且易于广大社区使用。研究发现,即便对于最先进的LLMs,如GPT-5-high,其在ProfBench上的整体表现也仅为65.9%,显示出显著挑战。此外,我们观察到专有模型与开源权重模型之间存在显著性能差异,并深入探讨了扩展思维在处理复杂专业领域任务中的作用。数据访问:https://huggingface.co/datasets/nvidia/ProfBench,代码获取:https://github.com/NVlabs/ProfBench。
English
Evaluating progress in large language models (LLMs) is often constrained by
the challenge of verifying responses, limiting assessments to tasks like
mathematics, programming, and short-form question-answering. However, many
real-world applications require evaluating LLMs in processing professional
documents, synthesizing information, and generating comprehensive reports in
response to user queries. We introduce ProfBench: a set of over 7000
response-criterion pairs as evaluated by human-experts with professional
knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We
build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by
mitigating self-enhancement bias and reducing the cost of evaluation by 2-3
orders of magnitude, to make it fair and accessible to the broader community.
Our findings reveal that ProfBench poses significant challenges even for
state-of-the-art LLMs, with top-performing models like GPT-5-high achieving
only 65.9\% overall performance. Furthermore, we identify notable performance
disparities between proprietary and open-weight models and provide insights
into the role that extended thinking plays in addressing complex,
professional-domain tasks. Data:
https://huggingface.co/datasets/nvidia/ProfBench and Code:
https://github.com/NVlabs/ProfBench