DeepResearch Arena: 세미나 기반 과제를 통한 대형 언어 모델의 연구 능력 첫 평가

초록

심층 연구 에이전트는 문헌 종합, 방법론 설계, 실증 검증에 이르는 다단계 연구 워크플로를 조율할 수 있는 잠재력으로 인해 점점 더 많은 관심을 받고 있습니다. 이러한 진전에도 불구하고, 연구자들의 관심과 지적 호기심을 진정으로 포착하는 최전선 연구 질문을 수집하는 어려움으로 인해 그들의 연구 능력을 충실히 평가하는 것은 상당히 어려운 과제입니다. 이러한 격차를 해소하기 위해, 우리는 학술 세미나에 기반을 둔 벤치마크인 DeepResearch Arena를 소개합니다. 이 벤치마크는 풍부한 전문가 토론과 상호작용을 포착하여 실제 연구 환경을 더 잘 반영하고 데이터 유출 위험을 줄입니다. DeepResearch Arena를 자동으로 구축하기 위해, 우리는 세미나 기록에서 연구 가치가 있는 영감을 추출하는 다중 에이전트 계층적 작업 생성(MAHTG) 시스템을 제안합니다. MAHTG 시스템은 연구 가치가 있는 영감을 고품질 연구 작업으로 변환하여 연구 작업 수립의 추적 가능성을 보장하면서 노이즈를 필터링합니다. MAHTG 시스템을 통해, 우리는 문학, 역사, 과학 등 12개 학문 분야에 걸친 200개 이상의 학술 세미나에서 10,000개 이상의 고품질 연구 작업으로 DeepResearch Arena를 큐레이션했습니다. 우리의 광범위한 평가는 DeepResearch Arena가 현재 최첨단 에이전트들에게 상당한 도전을 제시하며, 다양한 모델 간에 명확한 성능 격차가 관찰됨을 보여줍니다.

English

Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers' attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.

DeepResearch Arena: 세미나 기반 과제를 통한 대형 언어 모델의 연구 능력 첫 평가

DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

초록

Support