ChatPaper.aiChatPaper

ResearchClawBench:端到端自主科學研究的基準

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

May 28, 2026
作者: Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang
cs.AI

摘要

AI編碼代理在科學工作中日益普及,但其端到端的自主研究能力仍難以驗證。我們推出了ResearchClawBench,這是一個橫跨10個科學領域、包含40項任務的自主科學研究能力評估基準。每項任務均基於真實已發表論文,提供相關文獻與原始數據,並在評估期間隱藏目標論文。專家精心策劃的多模態評分標準將目標科學成果拆解為加權標準,既能評估目標論文層級的「再發現」效果,又為「新發現」保留空間。我們透過統一的協議評估了七個自主研究(auto-research)代理,並通過輕量級ResearchHarness評估了十七個原生LLM。當前系統距離可靠的再發現仍有很大差距:最強的自主代理Claude Code平均得分為21.5,最強的ResearchHarness LLM Claude-Opus-4.7平均得分為20.7,而LLM前沿平均水平僅為26.5。誤差分析表明,失敗主要集中在實驗協議不匹配、證據不匹配以及缺失科學核心。ResearchClawBench為衡量自主科學研究的進展提供了一個可複現的評估前沿。
English
AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.