OneMillion-Bench: 言語エージェントは人間の専門家からどの程度離れているのか？

要旨

言語モデル（LM）がチャットアシスタントから、多段階推論やツール利用が可能な長期的エージェントへと進化する中、既存のベンチマークは依然として、現実世界の専門的な要求に十分対応できない構造化された試験形式の課題に留まっています。この問題に対処するため、我々は **OneMillion-Bench** を提案します。これは、法律、金融、産業、医療、自然科学にわたる専門家によって精選された400のタスクから構成されるベンチマークであり、経済的に重要なシナリオにおけるエージェントの評価を目的として構築されました。従来の研究とは異なり、このベンチマークでは、信頼できる情報源の検索、矛盾する証拠の解決、ドメイン固有のルールの適用、制約条件下での意思決定が要求され、その正しさは最終的な答えだけでなく推論プロセスにも依存します。我々は、事実の正確性、論理的一貫性、実現可能性、専門的な遵守状況を評価するルーブリックベースの評価プロトコルを採用し、専門家レベルの問題に焦点を当てることで、エージェント間の有意義な差別化を保証します。OneMillion-Benchは、ドメイン集約的なシナリオにおいて、エージェントの信頼性、専門性の深さ、実用性の準備状態を評価するための統一的なテストベッドを提供します。

English

As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. To this end, we introduce \OneMillion-Bench OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios. Unlike prior work, the benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain-specific rules, and making constraint decisions, where correctness depends as much on the reasoning process as the final answer. We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents. Together, \$OneMillion-Bench provides a unified testbed for assessing agentic reliability, professional depth, and practical readiness in domain-intensive scenarios.

OneMillion-Bench: 言語エージェントは人間の専門家からどの程度離れているのか？

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

要旨

Support