ChatPaper.aiChatPaper

ResearchClawBench:端到端自主科学研究基准

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

May 28, 2026
作者: Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang
cs.AI

摘要

AI编程助手在科研工作中应用日益广泛,但其端到端的自主研究能力仍难以验证。我们提出了ResearchClawBench,这是一个横跨10个科学领域、包含40项任务的自主科学研究评估基准。每项任务都基于已发表论文,提供相关文献和原始数据,并在评估过程中隐藏目标论文。专家设计的多模态评分标准将目标科学成果分解为分项加权指标,既支持对目标论文级别的再发现评估,也为新发现留有空间。我们通过统一协议评估了七个自主研究智能体,并借助轻量级ResearchHarness评估了十七个原生大语言模型。当前系统距离可靠的再发现仍有显著差距:表现最强的自主智能体Claude Code平均得分为21.5,ResearchHarness中最优的大语言模型Claude-Opus-4.7平均得分为20.7,而前沿大语言模型均值仅为26.5。错误分析表明,失败主要集中在实验方案不匹配、证据不匹配以及科学核心缺失三个方面。ResearchClawBench为衡量自主科学研究进展提供了可复现的评估基准。
English
AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.