SciPredict: 大規模言語モデルは自然科学分野における科学実験の結果を予測できるか？

要旨

科学発見を加速するには、コストのかかる物理的検証にリソースを投入する前に、どの実験が最良の結果をもたらすかを特定する必要がある。既存のベンチマークはLLMの科学的知識と推論能力を評価するが、AIが人間の能力を大幅に上回る可能性があるタスクである実験結果の予測能力については、ほとんど検討されていない。本研究では、物理学、生物学、化学の33の専門分野における最近の実証研究から抽出した405のタスクから構成されるベンチマーク「SciPredict」を紹介する。SciPredictは次の2つの重要な問いに答えるものである：(a) LLMは科学的実験の結果を十分な精度で予測できるか？(b) そのような予測は科学研究プロセスで確実に利用できるか？評価結果は両面における根本的な限界を明らかにした。モデルの精度は14～26％、人間の専門家の性能は約20％であった。一部の先進的なモデルは人間の性能を上回るものの、その精度は信頼できる実験指針を可能にする水準には程遠い。限られた性能の中でも、モデルは信頼できる予測と信頼できない予測を区別できておらず、自身の確信度や物理実験なしで結果が予測可能かどうかの判断に関わらず、精度は約20％に留まった。対照的に、人間の専門家は強い較正性を示した：実験を行うことなく結果がより予測可能であると判断するにつれ、その精度は約5％から約80％に上昇した。SciPredictは、実験科学における超人的性能の達成には、単により良い予測だけでなく、予測の信頼性に対するより良い認識が必要であることを示す厳密な枠組みを確立する。再現性のために、すべてのデータとコードはhttps://github.com/scaleapi/scipredict で公開している。

English

Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is approx20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only approx20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from approx5% to approx80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict

SciPredict: 大規模言語モデルは自然科学分野における科学実験の結果を予測できるか？

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

要旨

Support