利用人工智能预测科学进展

摘要

人工智能（AI）日益嵌入科学发现过程，但其能否预测科学进展仍不明确。为探究这一问题，我们提出了一种基于时间约束的评估框架，用于在可控知识条件下预测科学进展。我们引入了CUSP（基于截止时间条件的未知科学进展）基准——一个跨学科、事件级别的评测体系，通过可行性评估、机制推理、生成式解决方案设计及时间预测，系统评估AI系统的科学预测能力。在对4760个科学事件的观察中，我们发现当前前沿模型存在系统性的、领域依赖的局限性。尽管模型能够从竞争性候选方向中识别出合理的研究路径，但无法可靠预测科学进展是否实现，且系统性错误估计其发生时间。不同领域间的表现高度异质：AI领域进展的时间可预测性高于生物学、化学和物理学。模型表现与事件发生于训练数据截止时间前后无关，表明这些局限性不能仅归因于训练数据中的知识暴露。在受控信息访问条件下，增加截止时间前的知识可提升表现，但无法弥合与全信息场景之间的差距——这一差距在高被引进展中尤为显著。模型还表现出系统性过度自信和强烈响应偏差，显示出不确定性估计的不可靠性。综上，当前AI系统作为科学进展预测工具仍存在显著不足。已有知识的获取并未转化为可靠的预测能力，且模型更多受益于事件后信息而非前瞻性预测。

English

Artificial intelligence (AI) is increasingly embedded in scientific discovery, yet whether it can anticipate scientific progress remains unclear. To study this question, we introduce a temporally grounded evaluation framework for forecasting scientific progress under controlled knowledge constraints. We present CUSP (Cutoff-conditioned Unseen Scientific Progress), a multi-disciplinary and event-level benchmark that evaluates scientific forecasting in AI systems through feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Across 4,760 scientific events, we observe systematic and domain-dependent limitations in current frontier models. While models can identify plausible research directions from competing candidates, they fail to reliably predict whether scientific advances will be realized and systematically misestimate when they will occur. Performance is highly heterogeneous across domains, with the timing of AI progress more predictable than advances in biology, chemistry, and physics. Performance is largely insensitive to whether events occur before or after the training cutoff, suggesting these limitations cannot be explained solely by knowledge exposure in training data. Under controlled information access, additional pre-cutoff knowledge improves performance but does not close the gap to full-information settings, which becomes more pronounced for high-citation advances. Models also exhibit systematic overconfidence and strong response biases, indicating unreliable uncertainty estimation. Taken together, current AI systems fall short as predictive tools for scientific progress. Access to prior knowledge does not translate into reliable forecasting, and performance benefits more from post-event information than from forward-looking prediction.