SciPredict: 대규모 언어 모델은 자연과학 분야 과학 실험 결과를 예측할 수 있는가?

초록

과학적 발견의 가속화는 비용이 많이 드는 물리적 검증에 자원을 투입하기 전에 어떤 실험이 최상의 결과를 낼지 식별하는 것을 필요로 합니다. 기존 벤치마크가 LLM의 과학적 지식과 추론 능력을 평가하고는 있지만, AI가 인간 능력을 크게 능가할 수 있는 과제인 실험 결과 예측 능력은 여전히 크게 탐구되지 않고 있습니다. 우리는 물리학, 생물학, 화학의 33개 세부 분야에서 최근 수행된 실증 연구들을 바탕으로 도출한 405개 과제로 구성된 벤치마크인 SciPredict를 소개합니다. SciPredict는 두 가지 핵심 질문을 다룹니다: (a) LLM이 과학 실험의 결과를 충분한 정확도로 예측할 수 있는가? (b) 이러한 예측을 과학 연구 과정에 신뢰성 있게 활용할 수 있는가? 평가 결과 두 측면 모두에서 근본적인 한계가 드러났습니다. 모델 정확도는 14-26%였으며, 인간 전문가의 성과는 약 20%였습니다. 일부 최첨단 모델이 인간 성과를 능가하기는 했지만, 모델 정확도는 신뢰할 수 있는 실험 지침을 제공할 수 있을 만한 수준에는 훨씬 미치지 못했습니다. 제한된 성과 범위 내에서도 모델은 신뢰할 수 있는 예측과 신뢰할 수 없는 예측을 구분하지 못했으며, 자신의 확신 정도나 물리적 실험 없이도 결과를 예측 가능하다고 판단하는지와 관계없이 약 20%의 정확도만을 달성했습니다. 이와 대조적으로 인간 전문가는 강력한 보정 능력을 보여주었습니다: 실험을 수행하지 않고도 결과를 더 예측 가능하다고 판단할수록 그들의 정확도는 약 5%에서 약 80%로 증가했습니다. SciPredict는 실험 과학에서 초인적 성능을 달성하려면 더 나은 예측뿐만 아니라 예측 신뢰도에 대한 더 나은 인식이 필요함을 입증하는 엄격한 프레임워크를确立합니다. 재현성을 위해 모든 데이터와 코드는 https://github.com/scaleapi/scipredict 에서 제공됩니다.

English

Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is approx20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only approx20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from approx5% to approx80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict

SciPredict: 대규모 언어 모델은 자연과학 분야 과학 실험 결과를 예측할 수 있는가?

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

초록

Support