ChatPaper.aiChatPaper

CiteVQA:針對可信賴文件智能的證據歸因基準評測

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

May 13, 2026
作者: Dongsheng Ma, Jiayu Li, Zhengren Wang, Yijie Wang, Jiahao Kong, Weijun Zeng, Jutao Xiao, Jie Yang, Wentao Zhang, Bin Wang, Conghui He
cs.AI

摘要

多模态大型語言模型(MLLMs)在文件理解領域已取得顯著進展,然而現行的Doc-VQA評估僅針對最終答案進行評分,並未檢驗所引用的支持證據。這種僅以答案為導向的評估方式掩蓋了一個關鍵的失敗模式:模型可能基於錯誤段落推導出正確答案——這在法律、金融與醫療等高風險領域中構成重大風險,因為每項結論都必須可追溯至特定來源區域。為解決此問題,我們提出CiteVQA基準測試,要求模型在提供答案的同時,回傳元素層級的邊界框引用(bounding-box citations),並對二者進行聯合評估。CiteVQA涵蓋711份PDF文件中的1,897道問題,橫跨七個領域及兩種語言,每份文件平均長達40.6頁。為確保忠實性與可擴展性,真實引用(ground-truth citations)透過自動化流程生成——該流程利用遮罩消融(masking ablation)識別關鍵證據——並經專家審查驗證。評估核心為「嚴格屬性準確率」(Strict Attributed Accuracy, SAA),僅當答案與引用區域皆正確時才給予分數。針對20個MLLMs的審查揭示了一種普遍存在的「屬性幻覺」(Attribution Hallucination):模型經常給出正確答案,但所引用的區域卻是錯誤的。最強系統(Gemini-3.1-Pro-Preview)的SAA僅達76.0,而最強的開源MLLM僅達22.5。最終,為實現可信賴的文件智能,CiteVQA揭露了僅以答案為導向評估所忽略的可靠性差距,並提供了填補該差距所需的工具。我們的程式庫位於 https://github.com/opendatalab/CiteVQA。
English
Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at https://github.com/opendatalab/CiteVQA.