同じ質問、異なるソース、異なる回答：医療マルチソースRAGにおけるソース依存性の検証

要旨

複数の著者による機関コーパス上に展開された検索拡張生成（RAG）システムは、同じ質問に対して、どの情報源を検索するかに応じて異なる回答を提供する可能性がある。これは、従来の単一正解を前提とする評価パラダイムでは診断できない障害モードである。本稿では、情報源依存性が自然言語処理（NLP）評価における欠落した軸であると主張し、その監査には評価の単位を回答の正しさから情報源間の関係性へと移行する必要があると論じる。この概念を移植患者教育において具体化する。移植患者教育では、機関の情報源間に明らかな不一致が存在する。本稿では、次の3つの成果物を公開する。すなわち、実際の患者質問をベンチマークとし、各質問に対して複数の機関ハンドブックを候補情報源として生成を基盤付けたTransplantQA、各回答を基盤付けし監査する階層的検索戦略を実装したHERO-QA、そして検証済みの5ラベル分類法に基づいて情報源間の関係性を評価する構造化出力判定器である。大規模な評価において、より優れた検索は従来の推定よりもはるかに多くの不一致を明らかにする。これは不一致の強度ではなくその頻度を過小評価していたことを示す。本フレームワークはドメインに依存せず、法務や教育分野のRAGにも転用可能である。情報源依存性の測定は、一般的に展開される複数情報源NLPにとっての責務である。

English

A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.