MDocAgent：一個多模態多代理框架用於文件理解

摘要

文件問答（DocQA）是一項非常常見的任務。現有方法使用大型語言模型（LLMs）或大型視覺語言模型（LVLMs）以及檢索增強生成（RAG）通常優先考慮單一模態的信息，未能有效整合文本和視覺線索。這些方法在處理複雜的多模態推理時表現不佳，限制了其在現實世界文件中的性能。我們提出了MDocAgent（一種用於文件理解的多模態多代理框架），這是一種新穎的RAG和多代理框架，利用文本和圖像。我們的系統採用了五個專門的代理：一個通用代理、一個關鍵代理、一個文本代理、一個圖像代理和一個總結代理。這些代理進行多模態上下文檢索，結合各自的見解以實現對文件內容的更全面理解。這種協作方法使系統能夠從文本和視覺組件中綜合信息，從而提高問答的準確性。在MMLongBench、LongDocURL等五個基準上的初步實驗展示了我們MDocAgent的有效性，與當前最先進的方法相比，平均提高了12.1%。這項工作有助於開發更強大和全面的DocQA系統，能夠處理包含豐富文本和視覺信息的現實世界文件的複雜性。我們的數據和代碼可在https://github.com/aiming-lab/MDocAgent獲取。

English

Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi-modal reasoning, limiting their performance on real-world documents. We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state-of-the-art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real-world documents containing rich textual and visual information. Our data and code are available at https://github.com/aiming-lab/MDocAgent.

MDocAgent：一個多模態多代理框架用於文件理解

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

摘要

Support