FIRE-Bench:基于科学洞察再发现能力的智能体评估框架
FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights
February 2, 2026
作者: Zhen Wang, Fan Bai, Zhongyan Luo, Jinyan Su, Kaiser Sun, Xinle Yu, Jieyuan Liu, Kun Zhou, Claire Cardie, Mark Dredze, Eric P. Xing, Zhiting Hu
cs.AI
摘要
基於大型語言模型(LLM)的自動化智能體有望端到端地加速科學發現,但如何嚴格評估其可驗證的發現能力仍是核心挑戰。現有基準面臨兩難抉擇:要么過度依賴LLM作為評判者來評估自動生成的研究成果,要么優化便捷但孤立的性能指標,這些指標僅能粗糙地替代科學洞察力。為解決這一缺陷,我們推出FIRE-Bench(全週期洞察重現評估基準),該基準通過重現近期高影響力機器學習研究中已確立的發現成果來評估智能體。智能體僅獲取從已發表驗證研究中提取的高層次科學問題,即需自主探索思路、設計實驗、編寫代碼、執行計劃,並得出有實證依據的結論。我們在FIRE-Bench上評估了多款採用前沿LLM(如GPT-5)的頂尖智能體。結果表明,全週期科研對現有智能體系統仍具挑戰性:即便最強智能體的重現成功率也較低(F1分數<50),不同運行結果差異顯著,且在實驗設計、執行與證據推理方面呈現重複性錯誤模式。FIRE-Bench為衡量智能體驅動的可靠科學發現進展提供了嚴謹且具診斷性的評估框架。
English
Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery end-to-end, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either heavily rely on LLM-as-judge evaluations of automatically generated research outputs or optimize convenient yet isolated performance metrics that provide coarse proxies for scientific insight. To address this gap, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question extracted from a published, verified study and must autonomously explore ideas, design experiments, implement code, execute their plans, and derive conclusions supported by empirical evidence. We evaluate a range of state-of-the-art agents with frontier LLMs backbones like gpt-5 on FIRE-Bench. Our results show that full-cycle scientific research remains challenging for current agent systems: even the strongest agents achieve limited rediscovery success (<50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning. FIRE-Bench provides a rigorous and diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.