FIRE-Bench:基于科学发现重现的智能体评估框架
FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights
February 2, 2026
作者: Zhen Wang, Fan Bai, Zhongyan Luo, Jinyan Su, Kaiser Sun, Xinle Yu, Jieyuan Liu, Kun Zhou, Claire Cardie, Mark Dredze, Eric P. Xing, Zhiting Hu
cs.AI
摘要
基于大语言模型(LLM)的自主智能体有望端到端加速科学发现,但如何严谨评估其可验证的发现能力仍是核心挑战。现有基准面临两难抉择:要么过度依赖LLM作为评判者对自动生成的研究成果进行评估,要么优化便捷但孤立的性能指标,这些指标仅能粗略替代科学洞察力。为弥补这一空白,我们推出FIRE-Bench(全周期洞察重现阶段评估基准),该基准通过智能体重现近期高影响力机器学习研究中的既定发现来进行评估。智能体仅获得从已发表、已验证研究中提取的高层研究问题,即需自主探索思路、设计实验、编写代码、执行计划,并得出经实证证据支持的结论。我们在FIRE-Bench上评估了多款采用前沿LLM(如GPT-5)的先进智能体。结果表明,全周期科学研究对当前智能体系统仍具挑战性:即使最强智能体的重现阶段成功率也有限(F1分数<50),不同运行结果差异显著,且在实验设计、执行及证据推理方面呈现重复性错误模式。FIRE-Bench为衡量智能体驱动型科学发现的可靠性进展提供了严谨且具诊断性的评估框架。
English
Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery end-to-end, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either heavily rely on LLM-as-judge evaluations of automatically generated research outputs or optimize convenient yet isolated performance metrics that provide coarse proxies for scientific insight. To address this gap, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question extracted from a published, verified study and must autonomously explore ideas, design experiments, implement code, execute their plans, and derive conclusions supported by empirical evidence. We evaluate a range of state-of-the-art agents with frontier LLMs backbones like gpt-5 on FIRE-Bench. Our results show that full-cycle scientific research remains challenging for current agent systems: even the strongest agents achieve limited rediscovery success (<50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning. FIRE-Bench provides a rigorous and diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.