基于大语言模型的多样化科学假设搜索
Towards Diverse Scientific Hypothesis Search with Large Language Models
June 9, 2026
作者: Haorui Wang, Parshin Shojaee, Kazem Meidani, Kunyang Sun, José Miguel Hernández-Lobato, Teresa Head-Gordon, Jiajun He, Chandan K. Reddy, Chao Zhang, Yuanqi Du
cs.AI
摘要
大语言模型(LLMs)正加速推动科学发现,尤其是在生成有效科学假设等高级任务中展现出最新进展。然而,在许多发现场景中,目标并非识别单一最优假设——因为验证过程可能充满噪声且成本高昂,而科学家需从一组高质量的替代假设中获益,这些假设能对冲下游不确定性,从而找到最佳解决方案。然而,常用的进化搜索策略倾向于在假设生成中优先优化而非探索,搜索过程中的选择压力会导致多样性崩溃。受这些局限的启发,我们将假设搜索问题建模为采样问题:目标是在固定验证预算下,高效生成多样化且高质量的假设。基于这一视角,我们提出\ours——一个受经典并行回火算法启发的进化框架。该框架在多个温度层级上搜索假设,并通过跨温度的有原则信息交换来增强探索,同时不干扰收敛。在分子发现、方程发现和算法发现等多个领域中,我们的方法在相同验证预算下持续提升假设的质量与多样性,生成的候选假设在更昂贵、更复杂的下游计算验证中仍保持稳健。
English
Large language models (LLMs) are on the rise for accelerating scientific discovery, most recently in advanced tasks such as generating valid scientific hypotheses. Yet in many discovery settings, the goal is not to identify a single best hypothesis since validation can be noisy and expensive, and scientists benefit from a set of high-quality alternative hypotheses that hedge against downstream uncertainty for the best solutions. Nevertheless, commonly used evolutionary search recipes tend to prioritize optimization over exploration in hypothesis generation, and the resulting selection pressure during the search process leads to diversity collapse. Motivated by these limitations, we formulate hypothesis search as a sampling problem, where the objective is to efficiently produce diverse, high-quality hypotheses under a fixed validation budget. Building on this perspective, we propose \ours, an evolutionary framework inspired by the classical parallel tempering algorithm that searches hypotheses at multiple temperature levels and enables principled information exchange across temperatures to improve exploration without disrupting convergence. Across domains including molecular discovery, equation discovery, and algorithm discovery, our approach consistently improves both hypothesis quality and diversity under the same validation budget, and produces candidates that remain robust under more expensive downstream computational validations.