ChatPaper.aiChatPaper

证据链:迭代检索增强生成中的像素级视觉归因

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

May 2, 2026
作者: Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye
cs.AI

摘要

迭代式检索增强生成(iRAG)已发展成为通过逐步检索外部文档并进行推理来回答复杂多跳问题的强大范式。然而当前系统主要基于解析后的文本运行,这造成两个关键瓶颈:(1)粗粒度归因——用户需根据模糊的文本级引用在冗长文档中手动定位证据;(2)视觉语义损失——将富含视觉元素的文档(如含图表PDF、幻灯片)转换为文本时,会丢失对推理至关重要的空间逻辑与版式线索。为弥补这一差距,我们提出证据链(CoE),这是一个与检索器无关的可视化归因框架,利用视觉语言模型直接对检索文档候选集的截图进行推理。CoE无需特定格式解析,可输出精确边界框,在检索候选集内可视化完整推理链条。我们在两个不同基准上评估CoE:基于2WikiMultiHopQA构建的大规模结构化网页数据集Wiki-CoE,以及包含复杂图表和自由版式的演示幻灯片挑战性数据集SlideVQA。实验表明,经过微调的Qwen3-VL-8B-Instruct模型实现了稳健性能,在需要视觉版式理解的场景中显著优于基于文本的基线方法,同时建立了像素级可解释iRAG的检索器无关解决方案。代码已开源:https://github.com/PeiYangLiu/CoE.git。
English
Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) Coarse-grained attribution, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) Visual semantic loss, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present Chain of Evidence (CoE), a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: Wiki-CoE, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and SlideVQA, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.
PDF02May 7, 2026