PlainQAFact:生物醫學簡明語言摘要生成的自動事實性評估指標
PlainQAFact: Automatic Factuality Evaluation Metric for Biomedical Plain Language Summaries Generation
March 11, 2025
作者: Zhiwen You, Yue Guo
cs.AI
摘要
語言模型產生的幻覺輸出在醫療領域存在風險,尤其對於做出健康相關決策的非專業受眾而言。現有的真實性評估方法,如基於蘊含和問答(QA)的方法,在處理通俗語言摘要(PLS)生成時面臨困難,這是由於解釋性現象的引入,即為了增強理解而添加了源文件中未包含的外部內容(例如定義、背景、示例)。為解決這一問題,我們引入了PlainQAFact框架,該框架基於精細的人類註釋數據集PlainFact進行訓練,用於評估源簡化和解釋性句子的真實性。PlainQAFact首先對真實性類型進行分類,然後使用基於檢索增強QA的評分方法評估真實性。我們的方法輕量且計算效率高。實驗結果表明,現有的真實性指標無法有效評估PLS中的真實性,尤其是對於解釋性內容,而PlainQAFact則達到了最先進的性能。我們進一步分析了其在不同外部知識來源、答案提取策略、重疊度量和文檔粒度層次上的有效性,從而完善了其整體的真實性評估。
English
Hallucinated outputs from language models pose risks in the medical domain,
especially for lay audiences making health-related decisions. Existing
factuality evaluation methods, such as entailment- and question-answering-based
(QA), struggle with plain language summary (PLS) generation due to elaborative
explanation phenomenon, which introduces external content (e.g., definitions,
background, examples) absent from the source document to enhance comprehension.
To address this, we introduce PlainQAFact, a framework trained on a
fine-grained, human-annotated dataset PlainFact, to evaluate the factuality of
both source-simplified and elaboratively explained sentences. PlainQAFact first
classifies factuality type and then assesses factuality using a
retrieval-augmented QA-based scoring method. Our approach is lightweight and
computationally efficient. Empirical results show that existing factuality
metrics fail to effectively evaluate factuality in PLS, especially for
elaborative explanations, whereas PlainQAFact achieves state-of-the-art
performance. We further analyze its effectiveness across external knowledge
sources, answer extraction strategies, overlap measures, and document
granularity levels, refining its overall factuality assessment.Summary
AI-Generated Summary