MOOSE-Chem3：シミュレーション実験フィードバックによる実験ガイド仮説ランキングの実現に向けて

要旨

仮説ランキングは、特にウェットラボ実験が高コストでスループットが制限される自然科学分野において、自動化された科学的発見の重要な要素です。既存のアプローチは、実験前のランキングに焦点を当てており、大規模言語モデルの内部推論のみに依存し、実験結果を組み込んでいません。本論文では、実験結果に基づいて候補仮説を優先順位付けする「実験ガイド付きランキング」というタスクを提案します。しかし、自然科学分野で実際の実験を繰り返し行うことが現実的でないため、このような戦略を開発することは困難です。この問題に対処するため、我々は3つのドメイン知識に基づいた仮定を基にしたシミュレータを提案し、仮説の性能を既知の真の仮説との類似度の関数としてモデル化し、ノイズによって摂動させます。このシミュレータを検証するために、実験的に報告された結果を持つ124の化学仮説のデータセットを構築しました。このシミュレータを基盤として、共有された機能特性によって仮説をクラスタリングし、シミュレーションされた実験フィードバックから得られた洞察に基づいて候補を優先順位付けする疑似実験ガイド付きランキング手法を開発しました。実験の結果、我々の手法は実験前のベースラインや強力なアブレーションを上回る性能を示しました。

English

Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model's internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.