SciPredict：大型语言模型能否预测自然科学实验的结果？

摘要

加速科学发现需要在投入资源进行高成本物理验证之前，就能识别哪些实验可能产生最佳结果。虽然现有基准测试主要评估大语言模型的科学知识与推理能力，但其预测实验结果的能力——这一AI可能显著超越人类的领域——仍待深入探索。我们推出SciPredict基准测试，包含从物理学、生物学和化学三大领域33个专业子方向的最新实证研究中提取的405项任务。该基准旨在回答两个关键问题：(a)大语言模型能否以足够精度预测科学实验结果？(b)此类预测能否可靠应用于科研流程？评估结果显示两者均存在根本性局限：模型准确率仅为14-26%，人类专家表现约20%。尽管部分前沿模型超越人类水平，但其准确率仍远未达到可指导实验的可靠标准。更关键的是，模型在有限性能范围内仍无法区分预测可靠性——无论其置信度高低或是否判定结果可无需实验预测，其识别可靠预测的准确率仅约20%。与之形成鲜明对比的是，人类专家展现出强大的校准能力：当他们判定某结果无需实验即可预测时，准确率从约5%提升至约80%。SciPredict建立了一个严谨框架，证明要实现实验科学中的超人类表现，不仅需要更精准的预测，更需要具备预测可靠性的判断能力。为保障可复现性，所有数据与代码已开源：https://github.com/scaleapi/scipredict

English

Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is approx20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only approx20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from approx5% to approx80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict

SciPredict：大型语言模型能否预测自然科学实验的结果？

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

摘要

Support