CiteVQA:面向可信文档智能的证据归因基准测试
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
May 13, 2026
作者: Dongsheng Ma, Jiayu Li, Zhengren Wang, Yijie Wang, Jiahao Kong, Weijun Zeng, Jutao Xiao, Jie Yang, Wentao Zhang, Bin Wang, Conghui He
cs.AI
摘要
多模态大语言模型(MLLMs)显著推进了文档理解能力,然而当前文档视觉问答(Doc-VQA)评估仅对最终答案进行评分,却忽略了对支撑证据的核查。这种仅关注答案的评估模式掩盖了一个关键失效模式:模型可能给出正确答案,却将依据建立在错误段落上——这在法律、金融和医学等高风险领域尤为危险,因为这些领域的每个结论都必须可追溯至特定来源区域。为解决这一问题,我们提出了CiteVQA基准,该基准要求模型在给出答案的同时返回元素级边界框引用,并对二者进行联合评估。CiteVQA包含1,897个问题,覆盖711份PDF文档,横跨七个领域和两种语言,平均每份文档40.6页。为确保真实性和可扩展性,其真实引用通过自动流水线生成——该流水线利用掩码消融技术识别关键证据——并随后经专家审核验证。我们评估的核心指标是严格归因准确率(SAA),仅当答案和引用区域均正确时,才判定预测有效。对20个多模态大语言模型的审计揭示了一种普遍存在的归因幻觉:模型经常给出正确答案,却引用了错误区域。最强系统(Gemini-3.1-Pro-Preview)的SAA仅为76.0,而最强开源多模态模型仅达22.5。最终,为迈向可信赖的文档智能,CiteVQA暴露了仅关注答案的评估所忽视的可靠性差距,并提供了弥合这一差距所需的工具。我们的代码仓库位于https://github.com/opendatalab/CiteVQA。
English
Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at https://github.com/opendatalab/CiteVQA.