Xpertbench：基于量规评估的专家级任务平台

摘要

随着大语言模型在传统基准测试上的性能趋于平缓，一个关键挑战日益凸显：如何评估其在体现真正专家级认知的复杂开放式任务中的能力。现有评估框架存在领域覆盖狭窄、依赖通用任务或自我评估偏差等局限。为弥补这一空白，我们推出XpertBench——一个用于评估大语言模型在真实专业领域表现的高保真基准。该基准包含1,346个经过精心设计的任务，覆盖金融、医疗、法律服务、教育及双轨研究（STEM与人文学科）等80个专业领域。这些任务源自1,000余份领域专家（包括顶尖机构研究人员及具备丰富临床或产业经验的从业者）的提交成果，确保了卓越的生态效度。每项任务均采用精细化评分标准，多数包含15-40个加权检查点以评估专业严谨性。为实现规模化且符合人类标准的评估，我们提出ShotJudge创新评估范式，通过使用经专家少量示例校准的LLM评判员来规避自我奖励偏差。对前沿大语言模型的实证评估揭示出明显的性能天花板：即使领先模型最高成功率仅约66%，平均得分约55%。不同模型还展现出领域特异性分化，在定量推理与语言合成方面呈现非重叠优势。这些发现凸显出现有AI系统存在的显著"专家级差距"，并确立XpertBench作为推动通用助手向专业领域协作者转型的关键工具。

English

As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.

Xpertbench：基于量规评估的专家级任务平台

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

摘要

Support