MDocAgent:面向文档理解的多模态多智能体框架
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
March 18, 2025
作者: Siwei Han, Peng Xia, Ruiyi Zhang, Tong Sun, Yun Li, Hongtu Zhu, Huaxiu Yao
cs.AI
摘要
文档问答(DocQA)是一项非常常见的任务。现有方法通常采用大型语言模型(LLMs)或大型视觉语言模型(LVLMs)以及检索增强生成(RAG),往往偏重于单一模态的信息,未能有效整合文本与视觉线索。这些方法在处理复杂的多模态推理时表现欠佳,限制了其在现实世界文档上的性能。我们提出了MDocAgent(一种多模态多代理框架用于文档理解),这是一种新颖的RAG和多代理框架,充分利用了文本和图像信息。我们的系统配备了五个专门代理:通用代理、关键代理、文本代理、图像代理和总结代理。这些代理进行多模态上下文检索,结合各自的见解,实现对文档内容更全面的理解。这种协作方式使系统能够综合文本和视觉组件的信息,从而在问答准确性上取得提升。在MMLongBench、LongDocURL等五个基准上的初步实验证明了MDocAgent的有效性,相比当前最先进方法平均提升了12.1%。本工作为开发更强大、更全面的DocQA系统做出了贡献,这些系统能够处理包含丰富文本和视觉信息的现实世界文档的复杂性。我们的数据和代码可在https://github.com/aiming-lab/MDocAgent获取。
English
Document Question Answering (DocQA) is a very common task. Existing methods
using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and
Retrieval Augmented Generation (RAG) often prioritize information from a single
modal, failing to effectively integrate textual and visual cues. These
approaches struggle with complex multi-modal reasoning, limiting their
performance on real-world documents. We present MDocAgent (A Multi-Modal
Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent
framework that leverages both text and image. Our system employs five
specialized agents: a general agent, a critical agent, a text agent, an image
agent and a summarizing agent. These agents engage in multi-modal context
retrieval, combining their individual insights to achieve a more comprehensive
understanding of the document's content. This collaborative approach enables
the system to synthesize information from both textual and visual components,
leading to improved accuracy in question answering. Preliminary experiments on
five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of
our MDocAgent, achieve an average improvement of 12.1% compared to current
state-of-the-art method. This work contributes to the development of more
robust and comprehensive DocQA systems capable of handling the complexities of
real-world documents containing rich textual and visual information. Our data
and code are available at https://github.com/aiming-lab/MDocAgent.Summary
AI-Generated Summary