邁向可靠的生物醫學假說生成：評估大型語言模型中的真實性與幻覺現象

摘要

大型語言模型（LLMs）在生物醫學等科學領域展現了顯著的潛力，特別是在假設生成方面，它們能夠分析大量文獻、識別模式並提出研究方向。然而，一個關鍵挑戰在於評估生成假設的真實性，因為驗證其準確性通常需要大量的時間和資源。此外，LLMs中的幻覺問題可能導致生成看似合理但最終錯誤的假設，從而削弱其可靠性。為了系統性地研究這些挑戰，我們引入了TruthHypo，這是一個用於評估LLMs生成真實生物醫學假設能力的基準，以及KnowHD，這是一個基於知識的幻覺檢測器，用於評估假設在現有知識中的紮根程度。我們的結果顯示，LLMs在生成真實假設方面存在困難。通過分析推理步驟中的幻覺，我們證明KnowHD提供的紮根性分數是從LLMs多樣化輸出中過濾真實假設的有效指標。人類評估進一步驗證了KnowHD在識別真實假設和加速科學發現方面的實用性。我們的數據和源代碼可在https://github.com/Teddy-XiongGZ/TruthHypo獲取。

English

Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful biomedical hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at https://github.com/Teddy-XiongGZ/TruthHypo.

邁向可靠的生物醫學假說生成：評估大型語言模型中的真實性與幻覺現象

Toward Reliable Biomedical Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models

摘要

Support