동일 질문, 다른 출처, 다른 답변: 의료 다중 소스 RAG에서의 출처 의존성 감사

초록

검색 증강 생성(RAG) 시스템이 다중 저자 기관 말뭉치 상에 배포될 경우, 검색하는 출처에 따라 동일한 질문에 대해 서로 다른 답변을 제공할 수 있다. 이는 지배적인 단일 정답 패러다임이 진단할 수 없는 실패 모드이다. 우리는 출처 의존성이 NLP 평가의 누락된 축이며, 이를 감사한다는 것은 평가 단위를 답변 정확성에서 출처 간 관계로 전환하는 것을 의미한다고 주장한다. 우리는 이를 이식 환자 교육에서 구체화하는데, 여기서 기관 출처들은 명백히 상충한다. 세 가지 인공물을 공개한다: TransplantQA는 실제 환자 질문의 벤치마크로, 각 질문은 후보 출처로서 여러 기관 핸드북에 생성 과정을 근거하여 답변된다; HERO-QA는 각 답변을 근거화하고 감사하는 계층적 검색 전략이다; 그리고 검증된 5-레이블 분류체계로 출처 간 관계를 평가하는 구조화된 출력 판정기이다. 대규모로 볼 때, 더 나은 검색은 이전 추정치가 시사한 것보다 훨씬 더 많은 불일치를 드러내며, 그 강도가 아닌 보급률을 과소평가한다. 이 프레임워크는 도메인에 구애받지 않으며 법률 및 교육용 RAG로 전이된다: 출처 의존성을 측정하는 것은 일반적으로 배포된 다중 출처 NLP에 대한 책임이다.

English

A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.