**\$OneMillion-Bench：语言智能体距离人类专家还有多远？**

摘要

随着语言模型从对话助手逐步发展为能够进行多步骤推理和工具调用的长程智能体，现有基准测试仍主要局限于结构化或应试型任务，难以满足真实世界的专业需求。为此，我们推出百万基准（OneMillion-Bench）——一个涵盖法律、金融、工业、医疗保健与自然科学五大领域共400项专家级任务的评估体系，专为检验智能体在经济决策场景中的表现而构建。与既往研究不同，该基准要求智能体检索权威信源、辨析矛盾证据、运用领域特定规则并作出约束性决策，其正确性既取决于最终答案，更依赖于推理过程的严谨性。我们采用基于量规的评估方案，从事实准确性、逻辑连贯性、实践可行性与专业合规性四个维度进行评分，聚焦专家级问题以确保对不同智能体的有效区分。百万基准通过构建统一测试平台，为评估领域密集型场景下智能体的可靠性、专业深度与实践就绪度提供了全新标准。

English

As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. To this end, we introduce \OneMillion-Bench OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios. Unlike prior work, the benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain-specific rules, and making constraint decisions, where correctness depends as much on the reasoning process as the final answer. We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents. Together, \$OneMillion-Bench provides a unified testbed for assessing agentic reliability, professional depth, and practical readiness in domain-intensive scenarios.

\$OneMillion-Bench：语言智能体距离人类专家还有多远？

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

摘要

Support