信頼性のある生物医学的仮説生成に向けて：大規模言語モデルにおける真実性と幻覚生成の評価

要旨

大規模言語モデル（LLM）は、特に仮説生成において、膨大な文献を分析し、パターンを特定し、研究の方向性を提案する能力を示し、生体医学などの科学分野で大きな可能性を示しています。しかし、生成された仮説の真実性を評価する際に重要な課題があり、その正確性を検証するには多くの時間とリソースを要することがしばしばあります。さらに、LLMにおける幻覚（hallucination）問題は、一見もっともらしいが最終的には誤った仮説を生成する可能性があり、その信頼性を損なう要因となります。これらの課題を体系的に研究するために、我々はTruthHypoというベンチマークを導入し、LLMが真実的な生体医学仮説を生成する能力を評価します。また、KnowHDという知識ベースの幻覚検出器を開発し、仮説が既存の知識にどれだけ基づいているかを評価します。我々の結果は、LLMが真実的な仮説を生成するのに苦労していることを示しています。推論ステップにおける幻覚を分析することで、KnowHDが提供する基盤スコアが、LLMの多様な出力から真実的な仮説をフィルタリングするための有効な指標として機能することを実証します。人間による評価は、KnowHDが真実的な仮説を特定し、科学的発見を加速する上で有用であることをさらに裏付けます。我々のデータとソースコードはhttps://github.com/Teddy-XiongGZ/TruthHypoで公開されています。

English

Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful biomedical hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at https://github.com/Teddy-XiongGZ/TruthHypo.

信頼性のある生物医学的仮説生成に向けて：大規模言語モデルにおける真実性と幻覚生成の評価

Toward Reliable Biomedical Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models

要旨

Support