証拠の連鎖：反復的検索拡張生成のためのピクセルレベル視覚的帰属

要旨

反復的検索拡張生成（iRAG）は、外部文書を段階的に検索・推論することで複雑なマルチホップ質問に答える強力なパラダイムとして登場した。しかし、現行システムは主に解析済みテキスト上で動作するため、2つの重大なボトルネックが生じている。(1) 粗粒度な帰属：ユーザーが曖昧なテキストレベルの引用に基づいて長文書内の証拠を手動で特定する負担を強いられること。(2) 視覚的意味情報の喪失：視覚的に豊富な文書（スライド、図表を含むPDFなど）をテキストに変換する過程で、推論に不可欠な空間的論理やレイアウトの手がかりが失われること。この課題を解決するため、本論文ではChain of Evidence (CoE)を提案する。これは検索器に依存しない視覚的帰属フレームワークであり、Vision-Language Modelsを活用して検索された候補文書のスクリーンショット上で直接推論を行う。CoEは形式固有の解析を不要とし、正確なバウンディングボックスを出力して、検索された候補集合内での完全な推論チェーンを可視化する。我々はCoEを2つの異なるベンチマークで評価した。2WikiMultiHopQAに基づく構造化ウェブページの大規模データセットWiki-CoEと、複雑な図表と自由形式レイアウトを特徴とするプレゼンテーションスライドの難易度の高いデータセットSlideVQAである。実験結果により、ファインチューニングしたQwen3-VL-8B-Instructが頑健な性能を発揮し、視覚的レイアウト理解を要するシナリオにおいてテキストベースのベースラインを大幅に上回るとともに、ピクセルレベルの解釈可能なiRAGのための検索器非依存ソリューションを確立することを実証した。コードはhttps://github.com/PeiYangLiu/CoE.gitで公開されている。

English

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) Coarse-grained attribution, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) Visual semantic loss, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present Chain of Evidence (CoE), a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: Wiki-CoE, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and SlideVQA, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

証拠の連鎖：反復的検索拡張生成のためのピクセルレベル視覚的帰属

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

要旨

Support