VisRAG:基於視覺的檢索增強生成多模態文檔
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
October 14, 2024
作者: Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong Sun
cs.AI
摘要
檢索增強生成(RAG)是一種有效的技術,使大型語言模型(LLMs)能夠利用外部知識來進行生成。然而,目前的RAG系統僅基於文本,無法利用佈局和圖像等在現實世界多模態文檔中發揮關鍵作用的視覺信息。本文介紹了VisRAG,通過建立基於視覺-語言模型(VLM)的RAG流程,解決了這個問題。在這個流程中,不是首先解析文檔以獲取文本,而是直接使用VLM將文檔嵌入為圖像,然後檢索以增強VLM的生成。與傳統基於文本的RAG相比,VisRAG最大程度地保留和利用了原始文檔中的數據信息,消除了解析過程中引入的信息損失。我們收集了開源和合成數據來訓練VisRAG中的檢索器並探索各種生成方法。實驗表明,VisRAG在檢索和生成階段均優於傳統RAG,在傳統基於文本的RAG流程上實現了25-39%的端到端性能增益。進一步分析顯示,VisRAG能夠有效利用訓練數據,具有強大的泛化能力,使其成為多模態文檔上RAG的一個有前途的解決方案。我們的代碼和數據可在 https://github.com/openbmb/visrag 找到。
English
Retrieval-augmented generation (RAG) is an effective technique that enables
large language models (LLMs) to utilize external knowledge sources for
generation. However, current RAG systems are solely based on text, rendering it
impossible to utilize vision information like layout and images that play
crucial roles in real-world multi-modality documents. In this paper, we
introduce VisRAG, which tackles this issue by establishing a vision-language
model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the
document to obtain text, the document is directly embedded using a VLM as an
image and then retrieved to enhance the generation of a VLM. Compared to
traditional text-based RAG, VisRAG maximizes the retention and utilization of
the data information in the original documents, eliminating the information
loss introduced during the parsing process. We collect both open-source and
synthetic data to train the retriever in VisRAG and explore a variety of
generation methods. Experiments demonstrate that VisRAG outperforms traditional
RAG in both the retrieval and generation stages, achieving a 25--39\%
end-to-end performance gain over traditional text-based RAG pipeline. Further
analysis reveals that VisRAG is effective in utilizing training data and
demonstrates strong generalization capability, positioning it as a promising
solution for RAG on multi-modality documents. Our code and data are available
at https://github.com/openbmb/visrag .Summary
AI-Generated Summary