ChatPaper.aiChatPaper

ViDoRAG:基於動態迭代推理代理的視覺文件檢索增強生成

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

February 25, 2025
作者: Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao
cs.AI

摘要

理解視覺豐富文件中的資訊,對於傳統的檢索增強生成(RAG)方法而言,仍是一大挑戰。現有的基準測試主要集中於基於圖像的問答(QA),卻忽視了在密集視覺文件中進行高效檢索、理解與推理的基本挑戰。為彌補這一差距,我們引入了ViDoSeek,這是一個新穎的數據集,旨在評估RAG在需要複雜推理的視覺豐富文件上的表現。基於此,我們識別出當前RAG方法的關鍵限制:(i)純視覺檢索方法難以有效整合文本與視覺特徵,(ii)先前方法常分配不足的推理標記,限制了其效能。為應對這些挑戰,我們提出了ViDoRAG,這是一個專為視覺文件間複雜推理設計的新穎多代理RAG框架。ViDoRAG採用基於高斯混合模型(GMM)的混合策略,以有效處理多模態檢索。為進一步激發模型的推理能力,我們引入了一個包含探索、總結與反思的迭代代理工作流程,為研究RAG領域中的測試時擴展提供了一個框架。在ViDoSeek上的大量實驗驗證了我們方法的有效性與泛化能力。值得注意的是,ViDoRAG在競爭性的ViDoSeek基準測試上,表現優於現有方法超過10%。
English
Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark.

Summary

AI-Generated Summary

PDF202March 3, 2025