ResearchClawBench：端到端自主科學研究的基準

摘要

AI編碼代理在科學工作中日益普及，但其端到端的自主研究能力仍難以驗證。我們推出了ResearchClawBench，這是一個橫跨10個科學領域、包含40項任務的自主科學研究能力評估基準。每項任務均基於真實已發表論文，提供相關文獻與原始數據，並在評估期間隱藏目標論文。專家精心策劃的多模態評分標準將目標科學成果拆解為加權標準，既能評估目標論文層級的「再發現」效果，又為「新發現」保留空間。我們透過統一的協議評估了七個自主研究（auto-research）代理，並通過輕量級ResearchHarness評估了十七個原生LLM。當前系統距離可靠的再發現仍有很大差距：最強的自主代理Claude Code平均得分為21.5，最強的ResearchHarness LLM Claude-Opus-4.7平均得分為20.7，而LLM前沿平均水平僅為26.5。誤差分析表明，失敗主要集中在實驗協議不匹配、證據不匹配以及缺失科學核心。ResearchClawBench為衡量自主科學研究的進展提供了一個可複現的評估前沿。

English

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.