증거 연쇄: 반복적 검색-증강 생성의 픽셀 수준 시각적 귀속

초록

반복적 검색-증강 생성(iRAG)은 외부 문서를 점진적으로 검색하고 추론함으로써 복잡한 다중 논리적 질문에 답변하는 강력한 패러다임으로 부상했습니다. 그러나 현재 시스템은 주로 파싱된 텍스트 기반으로 운영되어 두 가지 중요한 병목 현상을 야기합니다: (1) 사용자가 모호한 텍스트 수준 인용을 바탕으로 긴 문서 내에서 증거를 수동으로 찾아야 하는 조악한 수준의 귀속 문제, (2) 시각적으로 풍부한 문서(예: 슬라이드, 차트가 포함된 PDF)를 텍스트로 변환할 때 추론에 필수적인 공간적 논리와 레이아웃 단서가 사라지는 시각 의미 손실 문제입니다. 이 격차를 해소하기 위해 우리는 Vision-Language 모델을 활용하여 검색된 문서 후보들의 스크린샷을 직접 추론하는 검색기-독립적 시각 귀속 프레임워크인 Chain of Evidence(CoE)를 제안합니다. CoE는 형식별 파싱을 제거하고 정확한 바운딩 박스를 출력하여 검색된 후보 집합 내 완전한 추론 체인을 시각화합니다. 우리는 CoE를 두 가지 상이한 벤치마크에서 평가합니다: 2WikiMultiHopQA에서 파생된 구조화된 웹 페이지의 대규모 데이터셋인 Wiki-CoE와 복잡한 다이어그램 및 자유 형식 레이아웃을 특징으로 하는 프레젠테이션 슬라이드의 도전적 데이터셋인 SlideVQA입니다. 실험 결과, 미세 조정된 Qwen3-VL-8B-Instruct 모델이 강력한 성능을 달성하여 시각적 레이아웃 이해가 필요한 시나리오에서 텍스트 기반 베이스라인을 크게 능가함과 동시에 픽셀 수준 해석 가능한 iRAG를 위한 검색기-독립적 솔루션을 정립함을 보여줍니다. 우리의 코드는 https://github.com/PeiYangLiu/CoE.git에서 확인할 수 있습니다.

English

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) Coarse-grained attribution, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) Visual semantic loss, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present Chain of Evidence (CoE), a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: Wiki-CoE, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and SlideVQA, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

증거 연쇄: 반복적 검색-증강 생성의 픽셀 수준 시각적 귀속

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

초록

Support