SciPredict：大型语言模型能否预测自然科学实验的结果？

摘要

加速科学发现需要在投入资源进行高成本物理验证之前，就能识别哪些实验能产生最佳结果。虽然现有基准测试主要评估大型语言模型的科学知识和推理能力，但其预测实验结果的能力——这项任务本可让AI显著超越人类能力——仍属未充分探索的领域。我们推出SciPredict基准测试，包含从物理学、生物学和化学三大领域33个专业子方向的最新实证研究中提取的405项任务。该基准旨在回答两个关键问题：(a) LLM能否以足够准确性预测科学实验结果？(b) 此类预测能否可靠应用于科研流程？评估结果揭示了两方面的根本局限：模型准确率仅为14-26%，人类专家表现约为20%。尽管部分前沿模型超越人类表现，但其准确率仍远未达到可提供可靠实验指导的水平。更关键的是，即使在有限性能范围内，模型也无法区分可靠与不可靠的预测——无论其置信度高低或是否判定结果无需实验即可预测，其准确率始终维持在20%左右。与之形成鲜明对比的是，人类专家展现出强大的校准能力：当他们判定结果无需实验即可预测时，准确率从约5%显著提升至约80%。SciPredict建立了一个严谨框架，证明要实现实验科学中的超人类表现，不仅需要更精准的预测，更需要具备对预测可靠性的判断能力。为保障可复现性，所有数据与代码已发布于https://github.com/scaleapi/scipredict。

English

Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is approx20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only approx20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from approx5% to approx80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict

SciPredict：大型语言模型能否预测自然科学实验的结果？

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

摘要

Support