Xpertbench:基于量规评估的专家级任务平台
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
March 27, 2026
作者: Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xinyu Chen, Tianci He, Jiani Hou, Liang Hu, Ziyun Huang, Yongzhe Hui, Jianpeng Jiao, Chennan Ju, Yingru Kong, Yiran Li, Mengyun Liu, Luyao Ma, Fei Ni, Yiqing Ni, Yueyan Qiu, Yanle Ren, Zilin Shi, Zaiyuan Wang, Wenjie Yue, Shiyu Zhang, Xinyi Zhang, Kaiwen Zhao, Zhenwei Zhu
cs.AI
摘要
随着大语言模型在传统基准测试上的性能趋于平缓,一个关键挑战日益凸显:如何评估其在体现真正专家级认知的复杂开放式任务中的能力。现有评估框架存在领域覆盖狭窄、依赖通用任务或自我评估偏差等局限。为弥补这一空白,我们推出XpertBench——一个用于评估大语言模型在真实专业领域表现的高保真基准。该基准包含1,346个经过精心设计的任务,覆盖金融、医疗、法律服务、教育及双轨研究(STEM与人文学科)等80个专业领域。这些任务源自1,000余份领域专家(包括顶尖机构研究人员及具备丰富临床或产业经验的从业者)的提交成果,确保了卓越的生态效度。每项任务均采用精细化评分标准,多数包含15-40个加权检查点以评估专业严谨性。为实现规模化且符合人类标准的评估,我们提出ShotJudge创新评估范式,通过使用经专家少量示例校准的LLM评判员来规避自我奖励偏差。对前沿大语言模型的实证评估揭示出明显的性能天花板:即使领先模型最高成功率仅约66%,平均得分约55%。不同模型还展现出领域特异性分化,在定量推理与语言合成方面呈现非重叠优势。这些发现凸显出现有AI系统存在的显著"专家级差距",并确立XpertBench作为推动通用助手向专业领域协作者转型的关键工具。
English
As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.