ChatPaper.aiChatPaper

邁向以大型語言模型進行多樣化科學假說搜尋

Towards Diverse Scientific Hypothesis Search with Large Language Models

June 9, 2026
作者: Haorui Wang, Parshin Shojaee, Kazem Meidani, Kunyang Sun, José Miguel Hernández-Lobato, Teresa Head-Gordon, Jiajun He, Chandan K. Reddy, Chao Zhang, Yuanqi Du
cs.AI

摘要

大型語言模型(LLMs)正加速推動科學發現的發展,尤其在產生有效科學假說等高階任務中展現潛力。然而在許多探索情境中,目標並非找出單一最佳假說——因為驗證過程可能充滿雜訊且成本高昂,而科學家若能取得一組高品質的替代假說,便有助於針對下游不確定性進行風險規避,以尋找最理想的解決方案。然而,常用的演化搜索策略往往在假說生成過程中優先考量優化而非探索,導致搜尋過程中的選擇壓力引發多樣性崩潰。為解決此限制,我們將假說搜索重新定義為取樣問題:目標是在固定的驗證預算下,有效產生兼具多樣性與高品質的假說。奠基於此觀點,我們提出 \ours——一種受經典平行回火演算法啟發的演化框架,能在多個溫度層級中搜索假說,並透過跨溫度的原則性資訊交換提升探索能力,同時不影響收斂過程。在分子發現、方程式發現與演算法發現等領域中,本方法在相同驗證預算下持續改善假說品質與多樣性,且所產生的候選方案在更昂貴的下游計算驗證中仍保持穩健性。
English
Large language models (LLMs) are on the rise for accelerating scientific discovery, most recently in advanced tasks such as generating valid scientific hypotheses. Yet in many discovery settings, the goal is not to identify a single best hypothesis since validation can be noisy and expensive, and scientists benefit from a set of high-quality alternative hypotheses that hedge against downstream uncertainty for the best solutions. Nevertheless, commonly used evolutionary search recipes tend to prioritize optimization over exploration in hypothesis generation, and the resulting selection pressure during the search process leads to diversity collapse. Motivated by these limitations, we formulate hypothesis search as a sampling problem, where the objective is to efficiently produce diverse, high-quality hypotheses under a fixed validation budget. Building on this perspective, we propose \ours, an evolutionary framework inspired by the classical parallel tempering algorithm that searches hypotheses at multiple temperature levels and enables principled information exchange across temperatures to improve exploration without disrupting convergence. Across domains including molecular discovery, equation discovery, and algorithm discovery, our approach consistently improves both hypothesis quality and diversity under the same validation budget, and produces candidates that remain robust under more expensive downstream computational validations.