xbench: 専門分野に沿った実世界評価によるエージェント生産性スケーリングの追跡

要旨

我々は、AIエージェントの能力と実世界での生産性のギャップを埋めるために設計された、動的で専門職に即した評価スイート「xbench」を紹介する。既存のベンチマークはしばしば孤立した技術スキルに焦点を当てるが、それらは専門的な環境でエージェントが提供する経済的価値を正確に反映しない場合がある。この問題に対処するため、xbenchは産業の専門家によって定義された評価タスクを用いて、商業的に重要な領域をターゲットとする。我々のフレームワークは、生産性価値と強く相関するメトリクスを作成し、技術と市場の適合性（TMF）の予測を可能にし、製品能力の経時的な追跡を容易にする。初期実装として、我々は「採用」と「マーケティング」の2つのベンチマークを提示する。採用においては、実際のヘッドハンティング業務シナリオから50のタスクを収集し、企業マッピング、情報検索、人材ソーシングにおけるエージェントの能力を評価する。マーケティングにおいては、836人の候補インフルエンサーを精選したプールを用いて、50の広告主の要件に基づき、エージェントがインフルエンサーを広告主のニーズに適合させる能力を評価する。我々は、主要な現代エージェントに対する初期評価結果を提示し、これらの専門領域におけるベースラインを確立する。我々の継続的に更新される評価セットと評価結果は、https://xbench.org で利用可能である。

English

We introduce xbench, a dynamic, profession-aligned evaluation suite designed to bridge the gap between AI agent capabilities and real-world productivity. While existing benchmarks often focus on isolated technical skills, they may not accurately reflect the economic value agents deliver in professional settings. To address this, xbench targets commercially significant domains with evaluation tasks defined by industry professionals. Our framework creates metrics that strongly correlate with productivity value, enables prediction of Technology-Market Fit (TMF), and facilitates tracking of product capabilities over time. As our initial implementations, we present two benchmarks: Recruitment and Marketing. For Recruitment, we collect 50 tasks from real-world headhunting business scenarios to evaluate agents' abilities in company mapping, information retrieval, and talent sourcing. For Marketing, we assess agents' ability to match influencers with advertiser needs, evaluating their performance across 50 advertiser requirements using a curated pool of 836 candidate influencers. We present initial evaluation results for leading contemporary agents, establishing a baseline for these professional domains. Our continuously updated evalsets and evaluations are available at https://xbench.org.

xbench: 専門分野に沿った実世界評価によるエージェント生産性スケーリングの追跡

xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations

要旨

Support