xbench：通过职业对齐的真实世界评估追踪智能体生产力扩展

摘要

我们推出xbench，这是一套动态的、与专业领域对齐的评估体系，旨在弥合AI智能体能力与现实世界生产力之间的差距。现有基准测试往往聚焦于孤立的技术技能，可能无法准确反映智能体在专业场景中创造的经济价值。为此，xbench针对具有商业重要性的领域，由行业专家定义评估任务。我们的框架创建了与生产力价值高度相关的指标，能够预测技术市场契合度（TMF），并支持追踪产品能力随时间的演变。作为初步实施，我们展示了两个基准测试：招聘与营销。在招聘方面，我们从真实猎头业务场景中收集了50项任务，评估智能体在公司映射、信息检索和人才搜寻方面的能力。在营销方面，我们评估智能体根据广告主需求匹配影响者的能力，通过836位候选影响者库，针对50项广告主需求进行性能评估。我们展示了当代领先智能体的初步评估结果，为这些专业领域建立了基准。我们持续更新的评估集和评估结果可在https://xbench.org获取。

English

We introduce xbench, a dynamic, profession-aligned evaluation suite designed to bridge the gap between AI agent capabilities and real-world productivity. While existing benchmarks often focus on isolated technical skills, they may not accurately reflect the economic value agents deliver in professional settings. To address this, xbench targets commercially significant domains with evaluation tasks defined by industry professionals. Our framework creates metrics that strongly correlate with productivity value, enables prediction of Technology-Market Fit (TMF), and facilitates tracking of product capabilities over time. As our initial implementations, we present two benchmarks: Recruitment and Marketing. For Recruitment, we collect 50 tasks from real-world headhunting business scenarios to evaluate agents' abilities in company mapping, information retrieval, and talent sourcing. For Marketing, we assess agents' ability to match influencers with advertiser needs, evaluating their performance across 50 advertiser requirements using a curated pool of 836 candidate influencers. We present initial evaluation results for leading contemporary agents, establishing a baseline for these professional domains. Our continuously updated evalsets and evaluations are available at https://xbench.org.