PlainQAFact: 生物医学分野の平易な言語要約生成のための自動的事実性評価指標

要旨

言語モデルによる幻覚的な出力は、医療分野において特に健康関連の意思決定を行う一般の聴衆にとってリスクをもたらします。既存の事実性評価手法（例えば、含意関係や質問応答ベースの手法）は、平易な言語での要約（PLS）生成において困難を抱えています。これは、理解を深めるために元の文書には含まれていない外部の内容（定義、背景、例など）を導入する「詳細な説明現象」によるものです。この問題に対処するため、我々はPlainQAFactを提案します。これは、細かく人間が注釈を付けたデータセットPlainFactで訓練されたフレームワークであり、ソースを簡略化した文と詳細に説明された文の両方の事実性を評価します。PlainQAFactはまず事実性のタイプを分類し、その後、検索拡張型の質問応答ベースのスコアリング手法を用いて事実性を評価します。我々のアプローチは軽量で計算効率が高いです。実験結果は、既存の事実性評価指標がPLS、特に詳細な説明の事実性を効果的に評価できないのに対し、PlainQAFactが最先端の性能を達成することを示しています。さらに、外部知識源、回答抽出戦略、重複度測定、文書の粒度レベルにわたる有効性を分析し、全体的な事実性評価を洗練させます。

English

Hallucinated outputs from language models pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing factuality evaluation methods, such as entailment- and question-answering-based (QA), struggle with plain language summary (PLS) generation due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the source document to enhance comprehension. To address this, we introduce PlainQAFact, a framework trained on a fine-grained, human-annotated dataset PlainFact, to evaluate the factuality of both source-simplified and elaboratively explained sentences. PlainQAFact first classifies factuality type and then assesses factuality using a retrieval-augmented QA-based scoring method. Our approach is lightweight and computationally efficient. Empirical results show that existing factuality metrics fail to effectively evaluate factuality in PLS, especially for elaborative explanations, whereas PlainQAFact achieves state-of-the-art performance. We further analyze its effectiveness across external knowledge sources, answer extraction strategies, overlap measures, and document granularity levels, refining its overall factuality assessment.

PlainQAFact: 生物医学分野の平易な言語要約生成のための自動的事実性評価指標

PlainQAFact: Automatic Factuality Evaluation Metric for Biomedical Plain Language Summaries Generation

要旨

Support