대규모 언어 모델을 활용한 다양한 과학적 가설 탐색을 향하여

초록

대규모 언어 모델(LLM)은 과학적 발견을 가속화하는 데 활용이 증가하고 있으며, 최근에는 타당한 과학적 가설을 생성하는 고급 작업에도 적용되고 있다. 그러나 많은 발견 설정에서 목표는 단일 최적 가설을 식별하는 것이 아니다. 검증에는 비용이 많이 들고 잡음이 있을 수 있으며, 과학자들은 하류(downstream) 불확실성에 대비해 최적 해결책을 보완할 수 있는 고품질의 대안 가설 집합을 필요로 하기 때문이다. 그럼에도 불구하고, 일반적으로 사용되는 진화적 탐색 방식은 가설 생성에서 최적화를 탐색보다 우선시하는 경향이 있으며, 탐색 과정에서 발생하는 선택 압력은 다양성 붕괴(diversity collapse)로 이어진다. 이러한 한계에 착안하여, 우리는 가설 탐색을 샘플링 문제로 정식화하고, 그 목표는 고정된 검증 예산 하에서 다양하고 고품질의 가설을 효율적으로 생산하는 것으로 설정한다. 이 관점에 기반하여, 우리는 고전적인 병렬 템퍼링(parallel tempering) 알고리즘에서 영감을 받은 진화적 프레임워크인 \ours를 제안한다. 이 프레임워크는 여러 온도 수준에서 가설을 탐색하고, 온도 간 원칙적인 정보 교환을 가능하게 하여 수렴을 방해하지 않으면서 탐색을 개선한다. 분자 발견, 방정식 발견, 알고리즘 발견 등 다양한 영역에서 우리의 접근 방식은 동일한 검증 예산 하에서 가설의 품질과 다양성을 일관되게 개선하며, 더 비용이 많이 드는 하류 계산 검증에서도 강건하게 유지되는 후보를 생성한다.

English

Large language models (LLMs) are on the rise for accelerating scientific discovery, most recently in advanced tasks such as generating valid scientific hypotheses. Yet in many discovery settings, the goal is not to identify a single best hypothesis since validation can be noisy and expensive, and scientists benefit from a set of high-quality alternative hypotheses that hedge against downstream uncertainty for the best solutions. Nevertheless, commonly used evolutionary search recipes tend to prioritize optimization over exploration in hypothesis generation, and the resulting selection pressure during the search process leads to diversity collapse. Motivated by these limitations, we formulate hypothesis search as a sampling problem, where the objective is to efficiently produce diverse, high-quality hypotheses under a fixed validation budget. Building on this perspective, we propose \ours, an evolutionary framework inspired by the classical parallel tempering algorithm that searches hypotheses at multiple temperature levels and enables principled information exchange across temperatures to improve exploration without disrupting convergence. Across domains including molecular discovery, equation discovery, and algorithm discovery, our approach consistently improves both hypothesis quality and diversity under the same validation budget, and produces candidates that remain robust under more expensive downstream computational validations.