同一问题,不同来源,不同答案:医学多源RAG中的来源依赖性审查
Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG
May 27, 2026
作者: Yubo Li, Rema Padman, Ramayya Krishnan
cs.AI
摘要
部署于多作者机构语料库的检索增强生成(RAG)系统可能因检索来源不同,对同一问题给出不同答案——这一故障模式是当前主导的单一标准答案范式无法诊断的。我们认为,源依赖是NLP评估中缺失的维度,对其进行审计意味着将评估单元从答案正确性转向源间关系。我们以移植患者教育领域为例实现这一理念,该领域的机构来源明显存在分歧,并发布三项成果:TransplantQA基准(包含真实患者问题,每道题均通过将生成过程锚定于多个机构手册作为候选来源);HERO-QA分层检索策略(既能锚定也能审计每个答案);以及一个结构化输出评判器(基于经过验证的五标签分类法对源间关系评分)。大规模实验表明,更好的检索所暴露的分歧远超先前估计——其低估的是分歧的普遍性而非强度。该框架具有领域通用性,可迁移至法律和教育领域的RAG:衡量源依赖是面向多来源NLP系统部署普遍应当承担的责任。
English
A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.