MOOSE-Chem2：透過階層式搜尋探索大型語言模型在細粒度科學假設發現中的極限

摘要

大型語言模型（LLMs）在自動化科學假設生成方面展現了潛力，然而現有方法主要產生的假設較為粗粒度，缺乏關鍵的方法論和實驗細節。我們引入並正式定義了細粒度科學假設發現這一新任務，該任務要求從粗略的初始研究方向生成詳細且可實驗操作的假設。我們將其框架為一個組合優化問題，並探討了在最大程度利用下，LLMs解決此問題的能力上限。具體而言，我們探討了四個基礎問題：(1) 如何最佳地利用LLM的內部啟發式來制定其自身認為在所有可能生成的假設中最有前景的細粒度假設，基於其內部評分——從而定義假設空間上的潛在獎勵景觀；(2) 此類由LLM評判的較好假設是否與真實假設表現出更強的對齊性；(3) 使用一組能力相似的多樣化LLMs來塑造獎勵景觀，是否比使用其中最強LLM的重複實例來定義獎勵景觀能產生更好的結果；以及(4) 一組相同的LLMs是否比單個LLM提供更可靠的獎勵景觀。為解決這些問題，我們提出了一種分層搜索方法，該方法逐步提出並將細節整合到假設中，從一般概念進展到具體的實驗配置。我們展示了這一分層過程平滑了獎勵景觀，並實現了更有效的優化。在一個基於近期化學文獻中專家註釋的細粒度假設的新基準上的實證評估表明，我們的方法始終優於強基線。

English

Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the novel task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM's internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent chemistry literature show that our method consistently outperforms strong baselines.

MOOSE-Chem2：透過階層式搜尋探索大型語言模型在細粒度科學假設發現中的極限

MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search

摘要

Support