ChatPaper.aiChatPaper

VisRAG:基于视觉的多模态文档检索增强生成

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

October 14, 2024
作者: Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong Sun
cs.AI

摘要

检索增强生成(RAG)是一种有效的技术,使大型语言模型(LLMs)能够利用外部知识源进行生成。然而,当前的RAG系统仅基于文本,导致无法利用布局和图像等在现实世界多模态文档中发挥关键作用的视觉信息。本文介绍了VisRAG,通过建立基于视觉-语言模型(VLM)的RAG流程来解决这一问题。在这个流程中,文档不是首先解析以获取文本,而是直接使用VLM作为图像进行嵌入,然后检索以增强VLM的生成。与传统基于文本的RAG相比,VisRAG最大化了原始文档中数据信息的保留和利用,消除了解析过程中引入的信息丢失。我们收集了开源和合成数据来训练VisRAG中的检索器,并探索了各种生成方法。实验表明,VisRAG在检索和生成阶段均优于传统RAG,在传统基于文本的RAG流程上实现了25-39%的端到端性能提升。进一步分析表明,VisRAG在利用训练数据方面效果显著,并具有强大的泛化能力,使其成为多模态文档上RAG的一个有前途的解决方案。我们的代码和数据可在https://github.com/openbmb/visrag 获取。
English
Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 25--39\% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is effective in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at https://github.com/openbmb/visrag .

Summary

AI-Generated Summary

PDF273November 16, 2024