同一问题，不同来源，不同答案：医学多源RAG中的来源依赖性审查

摘要

部署于多作者机构语料库的检索增强生成（RAG）系统可能因检索来源不同，对同一问题给出不同答案——这一故障模式是当前主导的单一标准答案范式无法诊断的。我们认为，源依赖是NLP评估中缺失的维度，对其进行审计意味着将评估单元从答案正确性转向源间关系。我们以移植患者教育领域为例实现这一理念，该领域的机构来源明显存在分歧，并发布三项成果：TransplantQA基准（包含真实患者问题，每道题均通过将生成过程锚定于多个机构手册作为候选来源）；HERO-QA分层检索策略（既能锚定也能审计每个答案）；以及一个结构化输出评判器（基于经过验证的五标签分类法对源间关系评分）。大规模实验表明，更好的检索所暴露的分歧远超先前估计——其低估的是分歧的普遍性而非强度。该框架具有领域通用性，可迁移至法律和教育领域的RAG：衡量源依赖是面向多来源NLP系统部署普遍应当承担的责任。

English

A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.