MDocAgent: ドキュメント理解のためのマルチモーダル・マルチエージェントフレームワーク

要旨

文書質問応答（DocQA）は非常に一般的なタスクである。既存の手法では、大規模言語モデル（LLM）や大規模視覚言語モデル（LVLM）、検索拡張生成（RAG）を利用するものが多いが、これらの手法は単一のモダリティからの情報を優先しがちで、テキストと視覚的な手がかりを効果的に統合することができない。これらのアプローチは複雑なマルチモーダル推論に苦戦し、実世界の文書に対する性能が制限されている。本論文では、MDocAgent（マルチモーダル・マルチエージェントフレームワークによる文書理解）を提案する。これは、テキストと画像の両方を活用する新しいRAGおよびマルチエージェントフレームワークである。我々のシステムは、一般エージェント、クリティカルエージェント、テキストエージェント、画像エージェント、要約エージェントの5つの専門エージェントを採用している。これらのエージェントはマルチモーダルな文脈検索を行い、個々の洞察を組み合わせることで、文書の内容をより包括的に理解する。この協調的なアプローチにより、システムはテキストと視覚的要素の両方から情報を統合し、質問応答の精度向上を実現する。MMLongBenchやLongDocURLなどの5つのベンチマークでの予備実験では、MDocAgentの有効性が示され、現在の最先端手法と比較して平均12.1%の改善を達成した。この研究は、豊富なテキストと視覚情報を含む実世界の文書の複雑さに対応できる、より堅牢で包括的なDocQAシステムの開発に貢献する。我々のデータとコードはhttps://github.com/aiming-lab/MDocAgentで公開されている。

English

Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi-modal reasoning, limiting their performance on real-world documents. We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state-of-the-art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real-world documents containing rich textual and visual information. Our data and code are available at https://github.com/aiming-lab/MDocAgent.

MDocAgent: ドキュメント理解のためのマルチモーダル・マルチエージェントフレームワーク

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

要旨

Support