과학적 발견을 위한 평가 기반 확장

초록

언어 모델은 가설 생성, 후보 해법 제안, 시스템 구현 및 반복적 개선을 위해 과학적 발견 과정에 점차 더 많이 활용되고 있습니다. 이러한 시행착오 루프의 핵심에는 검증기, 시뮬레이터 또는 작업별 점수 함수를 통해 후보 해법에 대한 피드백을 얻는 평가 과정이 자리잡고 있습니다. 기존 연구에서 평가의 중요성을 강조해왔지만, 평가 주도 발견 루프를 어떻게 체계적이고 효과적으로 확장하여 과학적 발견의 한계를 넓힐 수 있을지라는 문제를 명시적으로 정립하지는 않았으며, 본 논문은 이 문제를 해결하고자 합니다. 우리는 병렬 탐색, 피드백 주도 개선 및 지역적 선택을 전략적으로 결합하는 일반 프레임워크인 Simple Test-time Evaluation-driven Scaling(SimpleTES)을 소개합니다. 이를 통해 올바른 차원으로 평가 주도 발견 루프의 규모를 확장함으로써 얻을 수 있는 상당한 성능 향상을 확인했습니다. 6개 분야에 걸친 21개의 과학적 문제에서 SimpleTES는 GPT-OSS 모델을 사용하여 최첨단 해법을 발견했으며, 최신 프론티어 모델 베이스라인과 정교한 최적화 파이프라인 모두를 일관되게 능가했습니다. 특히, 우리는 널리 사용되는 LASSO 알고리즘의 속도를 2배 이상 향상시켰고, 게이트 오버헤드를 24.5% 줄이는 양자 회로 라우팅 정책을 설계했으며, 기존 최고 결과를 능가하는 새로운 에르되시 최소 중복 구성을 발견했습니다. 새로운 발견을 넘어, SimpleTES는 피드백 주도 학습을 자연스럽게 지도하는 궤적 수준의 기록을 생성합니다. 성공적인 궤적에 대해 사후 학습을 수행하면 모델은 기존에 접했던 문제의 효율성을 개선할 뿐만 아니라 접하지 못한 문제로도 일반화되어 기본 모델이 찾아내지 못하는 해법을 발견합니다. 종합적으로, 우리의 결과는 효과적인 평가 주도 루프 확장이 LLM 주도 과학적 발견을 진전시키는 핵심 축임을 입증하며, 이러한 성과를 실현하기 위한 간단하면서도 실용적인 프레임워크를 제공합니다.

English

Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery loops along the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely used LASSO algorithm by over 2x, designed quantum circuit routing policies that reduce gate overhead by 24.5%, and discovered new Erdos minimum overlap constructions that surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancing LLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.

과학적 발견을 위한 평가 기반 확장

Evaluation-driven Scaling for Scientific Discovery

초록

Support