MOOSE-Chem3: 시뮬레이션된 실험 피드백을 통한 실험 기반 가설 순위 결정 방향

초록

가설 순위 지정은 자동화된 과학적 발견의 중요한 구성 요소로, 특히 실험실 실험이 비용이 많이 들고 처리량이 제한된 자연과학 분야에서 더욱 그러합니다. 기존 접근법은 실험 전 순위 지정에 초점을 맞추며, 대규모 언어 모델의 내부 추론에만 의존하고 실험 결과를 통합하지 않습니다. 우리는 실험 결과를 기반으로 후보 가설의 우선순위를 정하는 실험-가이드 순위 지정 작업을 소개합니다. 그러나 자연과학 분야에서 실제 실험을 반복적으로 수행하는 것은 비현실적이기 때문에 이러한 전략을 개발하는 것은 어려운 과제입니다. 이를 해결하기 위해, 우리는 세 가지 도메인 기반 가정에 기초한 시뮬레이터를 제안하며, 이 시뮬레이터는 알려진 실제 가설과의 유사성에 기반하여 노이즈가 추가된 가설 성능을 모델링합니다. 우리는 시뮬레이터를 검증하기 위해 실험적으로 보고된 결과가 포함된 124개의 화학 가설 데이터셋을 구축했습니다. 이 시뮬레이터를 기반으로, 우리는 공유된 기능적 특성에 따라 가설을 클러스터링하고 시뮬레이션된 실험 피드백에서 도출된 통찰을 바탕으로 후보 가설의 우선순위를 정하는 의사 실험-가이드 순위 지정 방법을 개발합니다. 실험 결과, 우리의 방법이 실험 전 기준선과 강력한 제거 실험을 능가하는 것으로 나타났습니다.

English

Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model's internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.