\$OneMillion-Bench: 언어 에이전트는 인간 전문가로부터 얼마나 멀리 떨어져 있을까?

초록

언어 모델(LM)이 채팅 어시스턴트에서 다단계 추론과 도구 사용이 가능한 장기적 에이전트로 진화함에 따라, 기존 벤치마크는 여전히 실제 전문직 업무 수요를 충분히 반영하지 못하는 구조화되거나 시험 형식의 과제에 한정되어 있습니다. 이를 위해 우리는 경제적으로 중대한 시나리오에서 에이전트를 평가하기 위해 법률, 금융, 산업, 의료, 자연과학 분야를 아우르는 400개의 전문가 검증 과제로 구성된 벤치마크인 OneMillion-Bench를 소개합니다. 기존 연구와 달리, 본 벤치마크는 권위 있는 출처 검색, 상충되는 증거 해결, 도메인 특화 규칙 적용, 제약 조건 하의 의사결정을 요구하며, 정확도는 최종 답변뿐만 아니라 추론 과정에 크게 의존합니다. 우리는 사실적 정확성, 논리적 일관성, 실무적 실행 가능성, 전문성 준수를 평가하는 루브릭 기반 평가 프로토콜을 채택하여, 전문가 수준의 문제에 집중함으로써 에이전트 간 의미 있는 성능 차별화를 보장합니다. 종합적으로, OneMillion-Bench는 도메인 집약적 시나리오에서 에이전트의 신뢰성, 전문성 깊이, 실무 준비도를 평가하는 통합 테스트베드를 제공합니다.

English

As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. To this end, we introduce \OneMillion-Bench OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios. Unlike prior work, the benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain-specific rules, and making constraint decisions, where correctness depends as much on the reasoning process as the final answer. We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents. Together, \$OneMillion-Bench provides a unified testbed for assessing agentic reliability, professional depth, and practical readiness in domain-intensive scenarios.

\$OneMillion-Bench: 언어 에이전트는 인간 전문가로부터 얼마나 멀리 떨어져 있을까?

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

초록

Support