迈向可靠的生物医学假设生成：评估大语言模型的真实性与幻觉问题

摘要

大型语言模型（LLMs）在生物医学等科学领域展现出显著潜力，尤其是在假设生成方面，它们能够分析海量文献、识别模式并建议研究方向。然而，一个关键挑战在于评估生成假设的真实性，因为验证其准确性通常需要大量时间和资源。此外，LLMs中的幻觉问题可能导致生成看似合理但最终错误的假设，从而削弱其可靠性。为了系统研究这些挑战，我们引入了TruthHypo，一个用于评估LLMs生成真实生物医学假设能力的基准，以及KnowHD，一个基于知识的幻觉检测器，用于评估假设在现有知识中的扎根程度。我们的结果表明，LLMs在生成真实假设方面存在困难。通过分析推理步骤中的幻觉，我们证明KnowHD提供的扎根性评分是筛选LLMs多样化输出中真实假设的有效指标。人类评估进一步验证了KnowHD在识别真实假设和加速科学发现方面的实用性。我们的数据和源代码可在https://github.com/Teddy-XiongGZ/TruthHypo获取。

English

Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful biomedical hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at https://github.com/Teddy-XiongGZ/TruthHypo.

迈向可靠的生物医学假设生成：评估大语言模型的真实性与幻觉问题

Toward Reliable Biomedical Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models

摘要

Support