xbench:通过职业对齐的真实世界评估追踪智能体生产力扩展
xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations
June 16, 2025
作者: Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, Chen Sun, Han Hou, Hui Yang, James Pan, Jianan Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi-Hsin Hung, Yuan Jiang, Zexuan Liu, Zihan Yin, Zijian Ma, Zhiwen Mo
cs.AI
摘要
我们推出xbench,这是一套动态的、与专业领域对齐的评估体系,旨在弥合AI智能体能力与现实世界生产力之间的差距。现有基准测试往往聚焦于孤立的技术技能,可能无法准确反映智能体在专业场景中创造的经济价值。为此,xbench针对具有商业重要性的领域,由行业专家定义评估任务。我们的框架创建了与生产力价值高度相关的指标,能够预测技术市场契合度(TMF),并支持追踪产品能力随时间的演变。作为初步实施,我们展示了两个基准测试:招聘与营销。在招聘方面,我们从真实猎头业务场景中收集了50项任务,评估智能体在公司映射、信息检索和人才搜寻方面的能力。在营销方面,我们评估智能体根据广告主需求匹配影响者的能力,通过836位候选影响者库,针对50项广告主需求进行性能评估。我们展示了当代领先智能体的初步评估结果,为这些专业领域建立了基准。我们持续更新的评估集和评估结果可在https://xbench.org获取。
English
We introduce xbench, a dynamic, profession-aligned evaluation suite designed
to bridge the gap between AI agent capabilities and real-world productivity.
While existing benchmarks often focus on isolated technical skills, they may
not accurately reflect the economic value agents deliver in professional
settings. To address this, xbench targets commercially significant domains with
evaluation tasks defined by industry professionals. Our framework creates
metrics that strongly correlate with productivity value, enables prediction of
Technology-Market Fit (TMF), and facilitates tracking of product capabilities
over time. As our initial implementations, we present two benchmarks:
Recruitment and Marketing. For Recruitment, we collect 50 tasks from real-world
headhunting business scenarios to evaluate agents' abilities in company
mapping, information retrieval, and talent sourcing. For Marketing, we assess
agents' ability to match influencers with advertiser needs, evaluating their
performance across 50 advertiser requirements using a curated pool of 836
candidate influencers. We present initial evaluation results for leading
contemporary agents, establishing a baseline for these professional domains.
Our continuously updated evalsets and evaluations are available at
https://xbench.org.