评估驱动型科研扩展：加速科学发现的新范式

摘要

语言模型在科学发现中的应用日益广泛，包括生成假设、提出候选解决方案、构建系统及迭代优化。这些试错循环的核心在于评估机制——通过验证器、模拟器或任务特定评分函数获取候选方案反馈的过程。尽管已有研究强调评估的重要性，但如何以原则性且有效的方式扩展评估驱动型发现循环以推动科学发现边界的问题尚未被系统阐述，这正是本文要解决的核心问题。我们提出"简易测试时评估驱动扩展框架"（SimpleTES），该通用框架通过策略性整合并行探索、反馈驱动优化与局部选择机制，揭示了沿正确维度扩展评估驱动发现循环所带来的显著增益。在涵盖六大领域的21项科学问题中，SimpleTES基于开源GPT模型发现了尖端解决方案，其表现持续超越前沿模型基线及复杂优化流程。特别值得关注的是，我们实现了广泛使用的LASSO算法2倍以上加速，设计了使量子电路门开销降低24.5%的路由策略，并发现了超越已知最优结果的埃尔德什最小重叠新构造。除新颖发现外，SimpleTES生成的轨迹级历史数据可自然监督反馈驱动学习。当基于成功轨迹进行后训练时，模型不仅能提升已知问题的解决效率，还可泛化至未见问题，发现基础模型无法揭示的解决方案。我们的研究成果确立了评估驱动循环扩展作为推进LLM驱动科学发现的核心路径，并提供了实现这些增益的简洁实用框架。

English

Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery loops along the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely used LASSO algorithm by over 2x, designed quantum circuit routing policies that reduce gate overhead by 24.5%, and discovered new Erdos minimum overlap constructions that surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancing LLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.

评估驱动型科研扩展：加速科学发现的新范式

Evaluation-driven Scaling for Scientific Discovery

摘要

Support