인공지능을 활용한 과학적 진보 예측

초록

인공지능(AI)은 점점 더 과학적 발견에 통합되고 있지만, 그것이 과학적 진보를 예측할 수 있는지는 여전히 불분명하다. 이 질문을 연구하기 위해, 우리는 통제된 지식 제약 하에서 과학적 진보를 예측하기 위한 시간적 기반 평가 프레임워크를 도입한다. 우리는 CUSP(Cutoff-conditioned Unseen Scientific Progress)를 제시하는데, 이는 타당성 평가, 메커니즘 추론, 생성적 솔루션 설계, 시간적 예측을 통해 AI 시스템의 과학적 예측을 평가하는 다분야 및 사건 수준의 벤치마크이다. 4,760건의 과학적 사건에 걸쳐, 우리는 현재 최첨단 모델에서 체계적이고 분야 의존적인 한계를 관찰한다. 모델은 경쟁 후보들 중에서 그럴듯한 연구 방향을 식별할 수 있지만, 과학적 진보가 실현될지 여부를 신뢰성 있게 예측하지 못하며, 그것이 발생할 시점을 체계적으로 잘못 추정한다. 성능은 분야 전반에 걸쳐 매우 이질적이며, AI 진보의 시점은 생물학, 화학, 물리학의 진보보다 더 예측 가능하다. 성능은 훈련 데이터의 컷오프 이전 또는 이후에 발생하는 사건에 대해 대체로 민감하지 않으며, 이는 이러한 한계가 훈련 데이터의 지식 노출만으로 설명될 수 없음을 시사한다. 통제된 정보 접근 하에서, 추가적인 사전 컷오프 지식은 성능을 향상시키지만 완전 정보 환경과의 격차를 좁히지 못하며, 이 격차는 인용이 많은 진보에서 더 두드러진다. 모델은 또한 체계적인 과신과 강한 반응 편향을 보여, 불확실성 추정의 신뢰성 부족을 나타낸다. 종합하면, 현재의 AI 시스템은 과학적 진보를 위한 예측 도구로서 부족하다. 사전 지식에 대한 접근이 신뢰할 수 있는 예측으로 이어지지 않으며, 성능은 미래 지향적 예측보다 사후 정보로부터 더 많은 이점을 얻는다.

English

Artificial intelligence (AI) is increasingly embedded in scientific discovery, yet whether it can anticipate scientific progress remains unclear. To study this question, we introduce a temporally grounded evaluation framework for forecasting scientific progress under controlled knowledge constraints. We present CUSP (Cutoff-conditioned Unseen Scientific Progress), a multi-disciplinary and event-level benchmark that evaluates scientific forecasting in AI systems through feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Across 4,760 scientific events, we observe systematic and domain-dependent limitations in current frontier models. While models can identify plausible research directions from competing candidates, they fail to reliably predict whether scientific advances will be realized and systematically misestimate when they will occur. Performance is highly heterogeneous across domains, with the timing of AI progress more predictable than advances in biology, chemistry, and physics. Performance is largely insensitive to whether events occur before or after the training cutoff, suggesting these limitations cannot be explained solely by knowledge exposure in training data. Under controlled information access, additional pre-cutoff knowledge improves performance but does not close the gap to full-information settings, which becomes more pronounced for high-citation advances. Models also exhibit systematic overconfidence and strong response biases, indicating unreliable uncertainty estimation. Taken together, current AI systems fall short as predictive tools for scientific progress. Access to prior knowledge does not translate into reliable forecasting, and performance benefits more from post-event information than from forward-looking prediction.