ChatPaper.aiChatPaper

BABE:生物学竞技场基准测试平台

BABE: Biology Arena BEnchmark

February 5, 2026
作者: Junting Zhou, Jin Chen, Linfeng Hao, Denghui Cao, Zheyu Wang, Qiguang Chen, Chaoyou Fu, Jiaze Chen, Yuchen Wu, Ge Zhang, Mingxuan Wang, Wenhao Huang, Tong Yang
cs.AI

摘要

大型语言模型(LLMs)的快速发展已使其能力从基础对话扩展到高级科学推理。然而,现有生物学基准往往未能评估研究人员所需的关键能力:即整合实验结果与背景知识以得出有意义结论的素养。为填补这一空白,我们推出BABE(生物学竞技场基准),这是一个旨在评估生物AI系统实验推理能力的综合基准。BABE的独特之处在于其构建素材全部来自同行评审的研究论文和真实生物学研究,确保任务能反映实际科学探究的复杂性和跨学科特性。该基准要求模型进行因果推理和跨尺度推断,为评估AI系统能否像执业科学家一样思考提供了严谨框架,从而更真实地衡量其推动生物学研究的潜力。
English
The rapid evolution of large language models (LLMs) has expanded their capabilities from basic dialogue to advanced scientific reasoning. However, existing benchmarks in biology often fail to assess a critical skill required of researchers: the ability to integrate experimental results with contextual knowledge to derive meaningful conclusions. To address this gap, we introduce BABE(Biology Arena BEnchmark), a comprehensive benchmark designed to evaluate the experimental reasoning capabilities of biological AI systems. BABE is uniquely constructed from peer-reviewed research papers and real-world biological studies, ensuring that tasks reflect the complexity and interdisciplinary nature of actual scientific inquiry. BABE challenges models to perform causal reasoning and cross-scale inference. Our benchmark provides a robust framework for assessing how well AI systems can reason like practicing scientists, offering a more authentic measure of their potential to contribute to biological research.
PDF52February 7, 2026