ResearchClawBench: 종단간 자율 과학 연구를 위한 벤치마크

초록

AI 코딩 에이전트는 과학 연구에 점점 더 많이 활용되고 있지만, 완전한 종단 간 자율 연구 능력을 검증하는 것은 여전히 어렵다. 본 연구에서는 10개 과학 분야의 40개 과제에 걸쳐 자율 과학 연구를 평가하기 위한 벤치마크인 ResearchClawBench를 제시한다. 각 과제는 실제 출판된 논문에 기반하며, 관련 문헌과 원시 데이터를 제공하고, 평가 중에는 목표 논문을 숨긴다. 전문가가 선별한 다중 모드 평가 루브릭은 목표 과학 결과물을 가중치 기준으로 분해하여, 목표 논문 수준의 재발견을 평가하면서도 새로운 발견의 여지를 남긴다. 우리는 통일된 프로토콜 하에 7개의 자율 연구 에이전트와 경량화된 ResearchHarness를 통해 17개의 네이티브 LLM을 평가한다. 현재 시스템은 신뢰할 수 있는 재발견 수준에 크게 미치지 못한다. 가장 강력한 자율 에이전트인 Claude Code는 평균 21.5점, 가장 강력한 ResearchHarness LLM인 Claude-Opus-4.7은 평균 20.7점, LLM 최전선 평균은 26.5점에 불과하다. 오류 분석 결과, 실패는 주로 실험 프로토콜 불일치, 증거 불일치, 과학적 핵심 요소 부재에 집중된다. ResearchClawBench는 자율 과학 연구를 향한 진전을 측정할 수 있는 재현 가능한 평가 최전선을 제공한다.

English

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.