\$OneMillion-Bench：语言智能体距离人类专家还有多远？

摘要

随着语言模型从聊天助手演变为能够进行多步骤推理和工具使用的长程智能体，现有基准测试大多仍局限于结构化或应试型任务，难以满足真实世界的专业需求。为此，我们推出百万基准（OneMillion-Bench），这一涵盖法律、金融、工业、医疗保健与自然科学五大领域的专家级测试集包含400项任务，旨在评估智能体在经济决策场景中的表现。与既往研究不同，该基准要求智能体检索权威信源、解决证据冲突、运用领域特定规则并做出约束性决策，其正确性既取决于最终答案也关乎推理过程。我们采用基于量规的评估方案，从事实准确性、逻辑连贯性、实践可行性与专业合规性四个维度进行评分，聚焦专家级问题以确保对不同智能体的有效区分。百万基准为评估领域密集型场景中智能体的可靠性、专业深度与实践准备度提供了统一测试平台。

English

As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. To this end, we introduce \OneMillion-Bench OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios. Unlike prior work, the benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain-specific rules, and making constraint decisions, where correctness depends as much on the reasoning process as the final answer. We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents. Together, \$OneMillion-Bench provides a unified testbed for assessing agentic reliability, professional depth, and practical readiness in domain-intensive scenarios.

\$OneMillion-Bench：语言智能体距离人类专家还有多远？

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

摘要

Support